All Case Studies

The Work

Every project documented with full methodology, data tables, what worked, what failed, and all deliverables linked. Updated as new work ships.

Landing Page Simulation  ·  March 7, 2026
Talk Stories: Landing Page to 6.65/10
5 rounds of synthetic persona simulation lifted conversion intent by 55% in a single day. No real users. No waiting. All models ran locally.
+55%
Intent lift
5
Sim rounds
100
Persona evals
4
Variants tested
2x
Share rate

Key Findings

  • Subtraction beat addition in round one, removing 3 things (ghostwriter label, "beta," scary Slack line) lifted intent by 1.75 points
  • "Voice Engine" won the framing test at 7.35/10. "Story Engineer" failed for the same reason abstract labels always fail
  • Security section dropped privacy objections from 35% to 25% in one iteration, mechanisms, not reassurances
  • Testimonials doubled the share rate, engineered to the exact objection, not generic praise
  • The page hit a copy ceiling at 6.65/10, the remaining objection requires product experience, not more words
Homepage Simulation  ·  March 7, 2026
Otto: Baseline Homepage Study
4 rounds of synthetic persona simulation against joinotto.com. Strong baseline, three persistent objections, and a clear ceiling: some conversion problems require product evidence, not more copy.
6.20
Baseline intent
20/20
Explore rate
4
Sim rounds
80
Persona evals
$0
Research cost

Key Findings

  • Strong baseline for the category: 6.20/10 intent, 20/20 would explore. The homepage communicates clearly and lands with the right audience
  • Free tier "$25K limit" confused 17/20 personas — unclear if it meant revenue, transactions, or something else
  • No free trial blocked 18/20 who earn more than $25K/year. Adding a 14-day trial was the highest single-round lift
  • AI accuracy anxiety persisted through all 4 rounds despite adding explanations. Personas want proof, not mechanism descriptions
  • Sweet spot audience is 2-6 years in: outgrown spreadsheets, not yet committed to a full accountant. Year-one and 10+ year veterans both scored lower
  • Pricing is not the conversion problem: guesses matched actual price, and the $500/month bookkeeper comparison landed cleanly
AI Infrastructure  ·  March 6–7, 2026
Local LLM Eval Farm: 24 Models, Zero Cloud
Complete evaluation infrastructure on a single Mac Studio. Six eval dimensions, a 5.8x throughput advantage discovered in a different backend, and a routing system built on real data.
24
Models
6
Eval dims
5.8x
MLX vs Ollama
88k
Peak tok/s
$0
Cloud cost

Key Findings

  • Size is not quality for conversation, qwen2.5:7b (5GB) scored 100% multi-turn; llama3.3:70b (42GB) scored 47.8%
  • MLX delivers 5.8x aggregate throughput vs Ollama at 32 concurrent users, invisible at low concurrency, massive at scale
  • --decode-concurrency 8 made things worse. MLX's dynamic batcher outperforms any fixed value
  • qwen2.5:7b wins on value: 80.6% quality, 100% multi-turn, 93% domain, 10k tok/s, 5GB
  • When every model scores 0%, the task is broken, found and fixed a wrong answer key mid-run