Prototypes: v1 original ↗ v2 ↗ v3 ↗ v4 ↗ v5 ↗ v7 final ↗
Case Study / Simulation Research — Complete

Otto: Homepage Simulation Study

7 rounds of synthetic persona simulation against joinotto.com. Discovered three methodology bugs, fixed them mid-study, reached the copy ceiling, and documented exactly where static pages hand off to product experience.

Client: Otto (joinotto.com) Date: March 2026 Rounds: 7 Personas: 20 per round Status: Complete
6.95
Final avg intent (v7, /10)
8/20
Would click free trial (40%)
17/20
Found a story written for them
3
Methodology bugs found and fixed
01 / Context

What we were hired to find out

Otto is an AI-powered accounting platform built specifically for self-employed creative professionals. It handles bookkeeping, taxes, invoicing, and contracts. Before investing in paid traffic or conversion rate optimization, the team wanted a baseline: how does the homepage land with the target audience, what's working, what's blocking signups, and what's the highest-leverage place to improve?

We ran a baseline simulation using 20 synthetic personas, then iterated across six more rounds — improving the page in some rounds and fixing our own measurement system in others. What we found about the methodology turned out to be as valuable as what we found about the page.

02 / Method

How the simulation works

Each persona is a synthetic profile built from demographic and behavioral patterns matching Otto's target market: self-employed creatives, 1-10 years in business, variable income, doing their own finances or relying on spreadsheets. Each persona reads the homepage text and answers structured questions about first impression, comprehension, signup intent (1-10), and what would make them more confident.

A separate judge model then reads all 20 responses and produces a structured verdict. Rounds run sequentially on the same 20 personas for consistent comparison.

Models: qwen2.5:7b for persona responses, llama3.3:70b as judge. All run locally on a Mac Studio with 256GB unified RAM. No cloud APIs, no per-token cost.

03 / Round Progression

Seven rounds: what changed and what we learned

RoundAvg IntentCTA RateVerdictPrimary change
v1 baseline 6.20 unknown Getting There Original joinotto.com homepage, no changes
v2 6.55 unknown Getting There New hero copy, free tier callout, rewritten features, UI mock
v3 6.35 unknown Getting There 14-day trial CTA, pricing tiers, AI+human explainer, FAQ expansion
v4 6.25 unknown Getting There Integrations strip, inline accuracy steps, cost comparison table
v5 6.80* unknown Getting There Story cards (before/after), bookkeeper correction in UI, "How it works" moved earlier, accuracy FAQ
v6 6.35 5/20 (25%) Nearly There Methodology fix only. Same v5 page. Fixed intent extraction, replaced forced-objection prompt, added binary CTA question, recalibrated judge
v7 6.95 8/20 (40%) Getting There Profession-specific grid, accuracy stats section, 6 segmented story cards, 2 better-fit personas added

* v5 avg corrected manually after identifying extraction bug; automated avg was unreliable that round.

The verdict inconsistency in v7: The judge returned "Getting There" despite v7 showing higher scores (6.95 vs 6.35) and a better CTA rate (40% vs 25%) than v6's "Nearly There." This is judge model variance, not page regression. The quantitative metrics — intent average and CTA click rate — are the reliable signal. The verdict label is noisy at this score range.

04 / What We Found Inside the Methodology

Three bugs that were suppressing every verdict

The most important finding in this study wasn't about Otto's homepage — it was about our own measurement system. After five rounds of flat "Getting There" verdicts despite significant page improvements, we ran a debugging session on the simulator itself. Three compounding bugs surfaced.

Bug 1 — Intent extraction failure

The regex r'SIGNUP INTENT.*?(\d+)' grabbed the "1" from "(1-10)" in the prompt format before reaching the actual score. In rounds v1-v5, up to 9 of 20 personas returned None instead of their real score. The judge was reconstructing intent from qualitative text alone — with incomplete data, it defaulted conservative.

Fix: Rewrote extraction to look for Score: [N] first, then N/10 pattern, then line-by-line after the label. v6 onward: 20/20 scores extracted, zero failures.

Bug 2 — Forced-objection prompt suppressed conversion signal

"TOP OBJECTION: What's the single biggest thing STOPPING YOU from signing up today?" forced every persona to find a reason not to convert — even ones who would have clicked the CTA. The judge read all these fabricated blockers as genuine conversion barriers.

Fix: Replaced with "CONFIDENCE GAPS: What specific information or experience would make you more confident before signing up?" Same information, framed as a gap rather than a blocker. Added a binary CTA question: "Would you click Start free trial right now?" — cuts through hedging language entirely.

Bug 3 — Judge misclassifying "I want to try it" as unconvinced

Persona language like "I'd need to try the free tier to be fully confident" was being coded as hesitation. For a freemium product, "I want to try the free tier" IS the conversion — that's what the CTA is for. The judge was penalizing the most desirable user behavior.

Fix: Added explicit calibration note to judge prompt: "I want to try the free trial" = conversion success, not hesitation. Treat as signal of intent, not of doubt.

Proof the fixes worked: v6 ran the exact same v5 page with only methodology fixes applied. Verdict jumped from "Getting There" (v5) to "Nearly There" (v6). The page had already earned a better verdict — we just couldn't measure it.

We also ran a control test: gave the judge a fabricated perfect scenario (all 20 personas at 10/10, zero objections) and confirmed it could return "Ship It." Then tested what score threshold triggered "Nearly There" — it required approximately 7.5/10 average with remaining gaps framed as "want to try it." This calibrated our expectations for what the verdicts actually mean.

05 / What Moved the Page

The changes that actually worked

Story cards (v5): biggest single page lift

Replacing three generic star-rating testimonials with before/after story cards — each with a specific dollar outcome, named persona, and quote tied to that outcome — was the highest-impact page change across all rounds. The judge noted the cards in every subsequent verdict. By v7, 17/20 personas found a story that felt written for them.

"The before/after user stories made me more likely to sign up because they show real improvements and savings, but I need to see more case studies specific to my industry."

Priya Mehta, UX/UI designer (v5 response)

Profession-specific grid (v7): recognition, not conversion

The "Who Otto is built for" grid with six creative type tiles (photographer, designer, content creator, consultant, artist/maker, service pro) increased story match rate to 17/20. But it didn't move intent scores substantially — recognition alone isn't conversion. Personas who saw their profession in the grid still needed to see their expenses in the story card to feel understood.

Accuracy stats (v7): partially addressed, still a gap

Pulling the 94% / 100% / 168 corrections stats out of the FAQ and into a prominent blue band addressed Priya and Tyler's specific confidence gap. But three personas still cited AI accuracy as their primary hesitation even after seeing the stats. Stats alone don't build trust — they set a floor. Stories and product experience build on top.

14-day trial CTA (v3): structural fix that stuck

Adding a 14-day free trial option for personas earning above the $25K free tier threshold was the single structural fix that addressed the most-cited baseline objection (18/20 had flagged "no free trial"). This change held through every subsequent round and is reflected in the final page.

Free tier contradiction fix (v5): small but necessary

The free tier callout box listed "human bookkeeper review" as a free feature while the pricing cards showed it as paid-only. Fixed in v5. Small change, but this kind of inconsistency is caught by careful readers and damages trust disproportionately.

06 / Persona Analysis

Who converts and who doesn't

Persona typev7 intentCTA clickPrimary gap
Event photographer (Emma Walsh)8/10yesSpecific category accuracy
Business coach (Nina Foster)8/10yesNone — high intent
YouTuber/content creator (Jasmine Cole)8/10yesPlatform-specific income tracking
Graphic designer (Jake Reeves)7/10noWants to try before committing
Motion graphics (Jordan Lee)7/10yesNone — would click
Marketing consultant (Rachel Torres)7/10yesNone — replaced old bookkeeper
Copywriter (Tyler Brooks)7/10yesNone — accuracy stats satisfied concern
Social media manager (Alex Rivera)7/10yesNone — invoicing resonated
Ceramics artist (Lily Morgan)7/10yesNone — story card matched
Illustrator (Sofia Ruiz)7/10noWants free trial period
Brand strategist (Aisha Okafor)7/10noIntegration with existing setup
Podcast producer (Sam Jordan)7/10noWants longer-term proof
Interior designer (Chloe Dubois)7/10noAI accuracy — wants demo
Tattoo artist (Dev Patel)7/10noStory helped but wants specific trial
Consultant (Daniel Park)7/10noIndustry-specific case studies
Freelance photographer (Maya Chen)5/10noWants to trial before committing
Music producer (Carlos Vega)6/10noNo musician story on page
Videographer (Marco Santos)6/10yesPhotographer story close enough

The sweet spot: Freelancers 2-6 years in who've outgrown spreadsheets but haven't committed to a full accountant relationship. They recognized their problem, found their story, and clicked. Year-one freelancers and very established operators with dedicated CPAs both scored lower — not Otto's best-fit audience.

Removed from target set: Marcus Hill (10-year video producer with dedicated accountant) and Ben Cho (developer managing finances via Stripe/spreadsheets) were removed after v6 as genuinely outside the Otto target. Including them suppressed the core audience signal.

07 / The Copy Ceiling

Where the page ends and the product begins

Across seven rounds, the most common remaining gap — even from personas who found their story, understood the pricing, and said they'd click the CTA — was some version of: "I'd feel fully confident after trying it." That's not a copywriting problem. That's the free trial being the conversion mechanism.

A 40% CTA click rate (8/20) from a cold read of a homepage is strong performance for a freemium product. Real-world cold traffic converts at much lower rates than synthetic personas because personas don't have competing tabs open, don't get distracted, and read every word. 40% simulation likely maps to 8-12% in production, which is healthy for a free trial offer.

The remaining 60% are not lost — they're pre-trial. Their confidence gap is "I want to see it work for me," which is exactly what the free tier delivers. The page's job is to get them to click, not to eliminate all uncertainty before the click. It's doing that job.

Final assessment: The v7 page has reached its copy ceiling. Average intent of 6.95/10 across the core target audience, 40% CTA click rate, 17/20 story match rate. Further copy changes are unlikely to meaningfully move the needle. Ship v7 and let the product close.

08 / Recommendations

What to do next

Ship v7

The v7 prototype is the deliverable. It outperforms the original on every measurable dimension: free tier clarity, trial availability, social proof depth, profession recognition, AI accuracy transparency. Replace or inform the current homepage with v7.

Add a musician / audio producer story card

Carlos Vega (music producer) was the clearest unmet need in v7. No story on the page speaks to his expenses: DAW software, sample packs, studio time, sync licensing income. One additional story card targeting audio/music professionals would address this gap. It's the only creative type in Otto's target market that v7 doesn't cover.

Consider a 60-second product walkthrough video

The most common confidence gap across all rounds was "I want to see it work." Video is the closest a static page can get to product experience. A 60-second screen recording of Otto categorizing real transactions — with a correction shown in real time — would address the AI accuracy hesitation that copy alone cannot fully resolve.

Post-launch: rebuild social proof with real user stories

The v7 story cards are fabricated to represent the target audience. As real users accumulate, replace them with actual before/after stories using the same structure: profession label, problem before, specific outcome, quote. One real story with real numbers outperforms three constructed ones.


09 / Methodology Notes for Future Studies

What we'd do differently from round one

This study included a mid-run methodology audit that should be standard practice in future simulation work. Key lessons:

Always include a binary CTA question. "Would you click the primary CTA right now?" gives a cleaner conversion signal than intent scores alone and cuts through hedging language.

Never ask for the "single biggest thing stopping you." This prompt guarantees every persona finds a blocker, even if they'd convert. Ask what's missing, not what's wrong.

Test the judge with control scenarios before trusting its verdicts. A "perfect" scenario (all 10/10, no objections) should return "Ship It." A strong scenario (7.5/10, "I want to try it") should return "Nearly There." Calibrate before running real rounds.

Validate intent extraction before running. Print 3-4 sample responses and manually check that the regex captures the actual score, not a number from the prompt format.

Verdict labels are noisy at the margin. The quantitative metrics (avg intent, CTA rate, story match rate) are more reliable than the label. A "Getting There" at 6.95 with 40% CTA is performing better than a "Nearly There" at 6.35 with 25% CTA.

Appendix / All Versions

Every prototype, in order

Click any version to open the full HTML prototype in a new tab.

VersionAvg IntentCTA RateKey changePrototype
v1 baseline 6.20 Original joinotto.com homepage joinotto.com ↗
v2 6.55 New hero, free tier callout, rewritten features Open ↗
v3 6.35 14-day trial, pricing tiers, AI+human explainer Open ↗
v4 6.25 Integrations strip, cost comparison table Open ↗
v5 6.80* Story cards, bookkeeper correction in UI, "How it works" moved up Open ↗
v6 6.35 25% Methodology fix only — same v5 page, clean measurement Same as v5 ↗
v7 — final 6.95 40% Profession grid, accuracy stats, 6 segmented story cards Open ↗

* v5 avg corrected manually after identifying extraction bug. Automated avg that round was unreliable.