7 rounds of synthetic persona simulation against joinotto.com. Discovered three methodology bugs, fixed them mid-study, reached the copy ceiling, and documented exactly where static pages hand off to product experience.
Otto is an AI-powered accounting platform built specifically for self-employed creative professionals. It handles bookkeeping, taxes, invoicing, and contracts. Before investing in paid traffic or conversion rate optimization, the team wanted a baseline: how does the homepage land with the target audience, what's working, what's blocking signups, and what's the highest-leverage place to improve?
We ran a baseline simulation using 20 synthetic personas, then iterated across six more rounds — improving the page in some rounds and fixing our own measurement system in others. What we found about the methodology turned out to be as valuable as what we found about the page.
Each persona is a synthetic profile built from demographic and behavioral patterns matching Otto's target market: self-employed creatives, 1-10 years in business, variable income, doing their own finances or relying on spreadsheets. Each persona reads the homepage text and answers structured questions about first impression, comprehension, signup intent (1-10), and what would make them more confident.
A separate judge model then reads all 20 responses and produces a structured verdict. Rounds run sequentially on the same 20 personas for consistent comparison.
Models: qwen2.5:7b for persona responses, llama3.3:70b as judge. All run locally on a Mac Studio with 256GB unified RAM. No cloud APIs, no per-token cost.
| Round | Avg Intent | CTA Rate | Verdict | Primary change |
|---|---|---|---|---|
| v1 baseline | 6.20 | unknown | Getting There | Original joinotto.com homepage, no changes |
| v2 | 6.55 | unknown | Getting There | New hero copy, free tier callout, rewritten features, UI mock |
| v3 | 6.35 | unknown | Getting There | 14-day trial CTA, pricing tiers, AI+human explainer, FAQ expansion |
| v4 | 6.25 | unknown | Getting There | Integrations strip, inline accuracy steps, cost comparison table |
| v5 | 6.80* | unknown | Getting There | Story cards (before/after), bookkeeper correction in UI, "How it works" moved earlier, accuracy FAQ |
| v6 | 6.35 | 5/20 (25%) | Nearly There | Methodology fix only. Same v5 page. Fixed intent extraction, replaced forced-objection prompt, added binary CTA question, recalibrated judge |
| v7 | 6.95 | 8/20 (40%) | Getting There | Profession-specific grid, accuracy stats section, 6 segmented story cards, 2 better-fit personas added |
* v5 avg corrected manually after identifying extraction bug; automated avg was unreliable that round.
The verdict inconsistency in v7: The judge returned "Getting There" despite v7 showing higher scores (6.95 vs 6.35) and a better CTA rate (40% vs 25%) than v6's "Nearly There." This is judge model variance, not page regression. The quantitative metrics — intent average and CTA click rate — are the reliable signal. The verdict label is noisy at this score range.
The most important finding in this study wasn't about Otto's homepage — it was about our own measurement system. After five rounds of flat "Getting There" verdicts despite significant page improvements, we ran a debugging session on the simulator itself. Three compounding bugs surfaced.
The regex r'SIGNUP INTENT.*?(\d+)' grabbed the "1" from "(1-10)" in the prompt format before reaching the actual score. In rounds v1-v5, up to 9 of 20 personas returned None instead of their real score. The judge was reconstructing intent from qualitative text alone — with incomplete data, it defaulted conservative.
Fix: Rewrote extraction to look for Score: [N] first, then N/10 pattern, then line-by-line after the label. v6 onward: 20/20 scores extracted, zero failures.
"TOP OBJECTION: What's the single biggest thing STOPPING YOU from signing up today?" forced every persona to find a reason not to convert — even ones who would have clicked the CTA. The judge read all these fabricated blockers as genuine conversion barriers.
Fix: Replaced with "CONFIDENCE GAPS: What specific information or experience would make you more confident before signing up?" Same information, framed as a gap rather than a blocker. Added a binary CTA question: "Would you click Start free trial right now?" — cuts through hedging language entirely.
Persona language like "I'd need to try the free tier to be fully confident" was being coded as hesitation. For a freemium product, "I want to try the free tier" IS the conversion — that's what the CTA is for. The judge was penalizing the most desirable user behavior.
Fix: Added explicit calibration note to judge prompt: "I want to try the free trial" = conversion success, not hesitation. Treat as signal of intent, not of doubt.
Proof the fixes worked: v6 ran the exact same v5 page with only methodology fixes applied. Verdict jumped from "Getting There" (v5) to "Nearly There" (v6). The page had already earned a better verdict — we just couldn't measure it.
We also ran a control test: gave the judge a fabricated perfect scenario (all 20 personas at 10/10, zero objections) and confirmed it could return "Ship It." Then tested what score threshold triggered "Nearly There" — it required approximately 7.5/10 average with remaining gaps framed as "want to try it." This calibrated our expectations for what the verdicts actually mean.
Replacing three generic star-rating testimonials with before/after story cards — each with a specific dollar outcome, named persona, and quote tied to that outcome — was the highest-impact page change across all rounds. The judge noted the cards in every subsequent verdict. By v7, 17/20 personas found a story that felt written for them.
"The before/after user stories made me more likely to sign up because they show real improvements and savings, but I need to see more case studies specific to my industry."
Priya Mehta, UX/UI designer (v5 response)
The "Who Otto is built for" grid with six creative type tiles (photographer, designer, content creator, consultant, artist/maker, service pro) increased story match rate to 17/20. But it didn't move intent scores substantially — recognition alone isn't conversion. Personas who saw their profession in the grid still needed to see their expenses in the story card to feel understood.
Pulling the 94% / 100% / 168 corrections stats out of the FAQ and into a prominent blue band addressed Priya and Tyler's specific confidence gap. But three personas still cited AI accuracy as their primary hesitation even after seeing the stats. Stats alone don't build trust — they set a floor. Stories and product experience build on top.
Adding a 14-day free trial option for personas earning above the $25K free tier threshold was the single structural fix that addressed the most-cited baseline objection (18/20 had flagged "no free trial"). This change held through every subsequent round and is reflected in the final page.
The free tier callout box listed "human bookkeeper review" as a free feature while the pricing cards showed it as paid-only. Fixed in v5. Small change, but this kind of inconsistency is caught by careful readers and damages trust disproportionately.
| Persona type | v7 intent | CTA click | Primary gap |
|---|---|---|---|
| Event photographer (Emma Walsh) | 8/10 | yes | Specific category accuracy |
| Business coach (Nina Foster) | 8/10 | yes | None — high intent |
| YouTuber/content creator (Jasmine Cole) | 8/10 | yes | Platform-specific income tracking |
| Graphic designer (Jake Reeves) | 7/10 | no | Wants to try before committing |
| Motion graphics (Jordan Lee) | 7/10 | yes | None — would click |
| Marketing consultant (Rachel Torres) | 7/10 | yes | None — replaced old bookkeeper |
| Copywriter (Tyler Brooks) | 7/10 | yes | None — accuracy stats satisfied concern |
| Social media manager (Alex Rivera) | 7/10 | yes | None — invoicing resonated |
| Ceramics artist (Lily Morgan) | 7/10 | yes | None — story card matched |
| Illustrator (Sofia Ruiz) | 7/10 | no | Wants free trial period |
| Brand strategist (Aisha Okafor) | 7/10 | no | Integration with existing setup |
| Podcast producer (Sam Jordan) | 7/10 | no | Wants longer-term proof |
| Interior designer (Chloe Dubois) | 7/10 | no | AI accuracy — wants demo |
| Tattoo artist (Dev Patel) | 7/10 | no | Story helped but wants specific trial |
| Consultant (Daniel Park) | 7/10 | no | Industry-specific case studies |
| Freelance photographer (Maya Chen) | 5/10 | no | Wants to trial before committing |
| Music producer (Carlos Vega) | 6/10 | no | No musician story on page |
| Videographer (Marco Santos) | 6/10 | yes | Photographer story close enough |
The sweet spot: Freelancers 2-6 years in who've outgrown spreadsheets but haven't committed to a full accountant relationship. They recognized their problem, found their story, and clicked. Year-one freelancers and very established operators with dedicated CPAs both scored lower — not Otto's best-fit audience.
Removed from target set: Marcus Hill (10-year video producer with dedicated accountant) and Ben Cho (developer managing finances via Stripe/spreadsheets) were removed after v6 as genuinely outside the Otto target. Including them suppressed the core audience signal.
Across seven rounds, the most common remaining gap — even from personas who found their story, understood the pricing, and said they'd click the CTA — was some version of: "I'd feel fully confident after trying it." That's not a copywriting problem. That's the free trial being the conversion mechanism.
A 40% CTA click rate (8/20) from a cold read of a homepage is strong performance for a freemium product. Real-world cold traffic converts at much lower rates than synthetic personas because personas don't have competing tabs open, don't get distracted, and read every word. 40% simulation likely maps to 8-12% in production, which is healthy for a free trial offer.
The remaining 60% are not lost — they're pre-trial. Their confidence gap is "I want to see it work for me," which is exactly what the free tier delivers. The page's job is to get them to click, not to eliminate all uncertainty before the click. It's doing that job.
Final assessment: The v7 page has reached its copy ceiling. Average intent of 6.95/10 across the core target audience, 40% CTA click rate, 17/20 story match rate. Further copy changes are unlikely to meaningfully move the needle. Ship v7 and let the product close.
The v7 prototype is the deliverable. It outperforms the original on every measurable dimension: free tier clarity, trial availability, social proof depth, profession recognition, AI accuracy transparency. Replace or inform the current homepage with v7.
Carlos Vega (music producer) was the clearest unmet need in v7. No story on the page speaks to his expenses: DAW software, sample packs, studio time, sync licensing income. One additional story card targeting audio/music professionals would address this gap. It's the only creative type in Otto's target market that v7 doesn't cover.
The most common confidence gap across all rounds was "I want to see it work." Video is the closest a static page can get to product experience. A 60-second screen recording of Otto categorizing real transactions — with a correction shown in real time — would address the AI accuracy hesitation that copy alone cannot fully resolve.
The v7 story cards are fabricated to represent the target audience. As real users accumulate, replace them with actual before/after stories using the same structure: profession label, problem before, specific outcome, quote. One real story with real numbers outperforms three constructed ones.
This study included a mid-run methodology audit that should be standard practice in future simulation work. Key lessons:
Always include a binary CTA question. "Would you click the primary CTA right now?" gives a cleaner conversion signal than intent scores alone and cuts through hedging language.
Never ask for the "single biggest thing stopping you." This prompt guarantees every persona finds a blocker, even if they'd convert. Ask what's missing, not what's wrong.
Test the judge with control scenarios before trusting its verdicts. A "perfect" scenario (all 10/10, no objections) should return "Ship It." A strong scenario (7.5/10, "I want to try it") should return "Nearly There." Calibrate before running real rounds.
Validate intent extraction before running. Print 3-4 sample responses and manually check that the regex captures the actual score, not a number from the prompt format.
Verdict labels are noisy at the margin. The quantitative metrics (avg intent, CTA rate, story match rate) are more reliable than the label. A "Getting There" at 6.95 with 40% CTA is performing better than a "Nearly There" at 6.35 with 25% CTA.
Click any version to open the full HTML prototype in a new tab.
| Version | Avg Intent | CTA Rate | Key change | Prototype |
|---|---|---|---|---|
| v1 baseline | 6.20 | — | Original joinotto.com homepage | joinotto.com ↗ |
| v2 | 6.55 | — | New hero, free tier callout, rewritten features | Open ↗ |
| v3 | 6.35 | — | 14-day trial, pricing tiers, AI+human explainer | Open ↗ |
| v4 | 6.25 | — | Integrations strip, cost comparison table | Open ↗ |
| v5 | 6.80* | — | Story cards, bookkeeper correction in UI, "How it works" moved up | Open ↗ |
| v6 | 6.35 | 25% | Methodology fix only — same v5 page, clean measurement | Same as v5 ↗ |
| v7 — final | 6.95 | 40% | Profession grid, accuracy stats, 6 segmented story cards | Open ↗ |
* v5 avg corrected manually after identifying extraction bug. Automated avg that round was unreliable.