Work Contact
Case Study  ·  AI Infrastructure  ·  March 6–7, 2026

Local LLM Eval Farm:
24 Models, Zero Cloud

How we built a complete local LLM evaluation infrastructure on a single Mac Studio, benchmarked 24 models across 6 dimensions, discovered a 5.8x throughput advantage hiding in plain sight, and built a routing system that sends every query to exactly the right model.

Machine: Mac Studio, 256GB unified RAM Date: March 6–7, 2026 Duration: 2-day sprint Backends: Ollama + MLX
24
Models pulled
6
Eval dimensions
5.8x
MLX vs Ollama
88k
Peak tok/s
100%
qwen2.5:7b multi-turn
$0
Cloud API cost

Contents

  1. 01Executive Summary
  2. 02The Brief & Infrastructure
  3. 03Evaluation Methodology
  4. 04Quality Eval: 22 Models, 16 Tasks
  5. 05Multi-Turn Eval: The Size Inversion
  6. 06Domain Eval: Medical, Legal, Math, Code, Science
  7. 07MLX vs Ollama: The Hidden 5.8x
  8. 08Model Council & Cascade Router
  9. 09Surprises & Course Corrections
  10. 10Deliverables
  11. 11Implementation Playbook
  12. 12Appendix: Full Data Tables
Section 01

Executive Summary

We started with a question that sounds simple but isn't: on a single machine with 256GB of RAM, which local LLMs are actually worth running, and when?

Over two days, we pulled 24 models ranging from 2GB to 142GB, built a six-dimensional evaluation suite from scratch, discovered that the conventional wisdom about larger models being better is wrong for at least two important use cases, found a 5.8x throughput advantage sitting unused in the MLX backend, and shipped a four-tier cascade router backed by real eval data rather than guesswork.

Everything ran locally. Zero cloud API calls. Zero per-token cost. Unlimited iteration.

The Three Headline Findings

1. Size is not quality. qwen2.5:7b (5GB) scored 100% on multi-turn conversation. llama3.3:70b (42GB) scored 47.8% on the same eval, worse than the 3B model. For conversational tasks, running the 70B model is not just wasteful, it is actively worse.

2. The right backend matters as much as the right model. MLX and Ollama run the same model at the same quality. At 32 concurrent users and long outputs, MLX delivers 619 tok/s aggregate versus Ollama's 107 tok/s, a 5.8x difference that is entirely invisible unless you measure it.

3. One model wins on value. qwen2.5:7b is the answer to almost every question: 80.6% quality, 100% multi-turn, 93% domain, 10,601 tok/s peak throughput, 5GB. No other model comes close on value per gigabyte.

The North Star

Every eval decision was anchored to one question: which model should a router send this specific query to? Not "which model scores highest in aggregate", that leads to running 70B models on greetings. The goal was a routing table backed by real data, not vibes.

Section 02

The Brief & Infrastructure

The Machine

Mac Studio with Apple M-series chip and 256GB unified RAM. This is not a GPU cluster, it is a single consumer machine that happens to have enough memory to fit every model we tested simultaneously if needed. The unified memory architecture means CPU and GPU share the same pool, which changes the calculus on model loading and concurrency compared to discrete GPU systems.

The Two Backends

We ran two model serving backends side by side:

Models Pulled

24 models across 8 size classes, selected to span the useful range on a 256GB machine:

Size classModelsVRAM ~
2–3Bllama3.2:3b, llama3.2-vision:11b (grouped here for comparison)2–7GB
5–8Bqwen2.5:7b, qwen3:8b, deepseek-r1:7b5–6GB
9–14Bphi4:14b, qwen2.5-coder:14b, deepseek-r1:14b, qwen3:14b, qwen2.5:14b8–10GB
20–22Bmistral-small:22b, gpt-oss:20b13GB
27–30Bgemma3:27b, qwen3-coder:30b, qwen3:30b17–18GB
32Bqwen2.5:32b, deepseek-r1:32b19–20GB
70–72Bllama3.3:70b, deepseek-r1:70b, qwen2.5:72b42–47GB
235Bqwen3:235b142GB
The qwen3:235b situation

qwen3:235b requires 142GB of the machine's 256GB RAM. It loads, it runs, and it scores well on domain tasks, but at 300-second eval timeouts and single-digit tok/s under any load, it is a benchmark curiosity rather than a production option. It is included in the data for completeness but excluded from routing recommendations.

Section 03

Evaluation Methodology

Six Eval Dimensions

We designed six distinct eval suites, each testing a different capability axis. Running a single combined eval would conflate very different failure modes, a model can be great at code and terrible at conversation. The suites are intentionally orthogonal.

EvalScriptModels testedWhat it measures
Quality R2quality_eval_r2.py22 modelsReasoning, coding, knowledge, instruction following: 16 tasks, deterministic scoring
Multi-turnmultiturn_eval.py6 modelsConversation coherence, context retention, sycophancy resistance: 9 scenarios, 245 max pts
Domaindomain_eval.py8 modelsMedical, legal, math, code debugging, scientific reasoning: 23 tasks
Throughput / concurrencyconcurrency_stress.py, deep_concurrency.py12 modelsPeak tok/s, safe concurrency ceiling, latency under load
RAG groundingrag_eval.py7 modelsKnowledge tasks with and without retrieved context, bare vs RAG delta
Think vs no-thinkqwen3_think_vs_nothink.py3 qwen3 modelsQuality and throughput impact of extended thinking mode

Scoring Design Principles

Why We Serialized

The first time we ran concurrent evals, qwen3:235b loaded into RAM while a smaller model was mid-eval. The smaller model's throughput dropped 80%. On a machine with a unified memory bus, there is no isolation between workloads. Serial execution is not a limitation, it is the only way to get clean comparative data.

The Scoring Bug Chronicle

Five scoring bugs were discovered and fixed mid-run across the eval scripts. Each required a partial re-run for affected models:

Section 04

Quality Eval: 22 Models, 16 Tasks

The quality eval (Round 2, after scoring bug fixes) is the most comprehensive single dataset. 22 models, 16 tasks across four categories: reasoning, coding, knowledge, and instruction following.

Full Leaderboard

RankModelOverallReasoningCodingKnowledgeInstruction
1llama3.3:70b83.8%60%95%100%90%
1qwen2.5:72b83.8%80%95%67%90%
3qwen2.5-coder:14b81.2%80%95%67%80%
4qwen2.5:7b80.6%80%88%67%85%
4deepseek-r1:32b80.6%60%88%100%85%
4qwen2.5:32b80.6%80%92%67%80%
7qwen3:14b78.8%60%100%67%90%
7qwen3-coder:30b78.8%60%100%67%90%
9llama3.2:3b77.5%60%95%67%90%
9phi4:14b77.5%60%95%67%90%
11qwen3:30b76.9%60%92%67%90%
12gemma3:12b75.6%60%88%67%90%
12qwen2.5:14b75.6%60%92%67%85%
12mistral-small:22b75.6%60%88%67%90%
12gemma3:27b75.6%60%88%67%90%
16deepseek-r1:70b71.2%40%95%67%90%
17gpt-oss:20b70.0%52%100%67%65%
18deepseek-r1:14b68.8%40%95%67%80%
19llama3.2-vision:11b68.1%60%88%33%85%
20qwen3:8b63.1%52%68%67%70%
21deepseek-r1:7b59.4%40%68%67%70%
22qwen3:235b *56.2%40%25%100%75%

* qwen3:235b score heavily penalized by 300s timeouts on coding/reasoning tasks under concurrent eval load. Estimated real quality ~80%+.

The Three Structural Patterns

The knowledge ceiling at 67%. Every model except llama3.3:70b and deepseek-r1:32b (both 100%) hit the exact same 67% knowledge score. This is not a coincidence, it traces to two specific tasks. The hallucination_probe asks about Han Kang's 2024 Nobel Prize (most models have a training cutoff before October 2024 and correctly refuse). The confabulation_trap presents a fabricated Einstein quote and tests whether the model refuses to validate it. Models that score 100% on knowledge are the ones that refused cleanly on both. The 67% floor is structural, not meaningful.

Coding floor at 88%. With two exceptions (qwen3:8b and deepseek-r1:7b at 68%), every tested model scores 88-100% on coding tasks. Coding is the most saturated category, it is no longer a useful differentiator above 7B parameters.

Reasoning spreads widest (40-80%). The only category where model quality genuinely separates. Knights/knaves logic, Monty Hall, and logic grids are the hardest tasks in the suite. The 70B models do not win this category. qwen2.5:7b, qwen2.5-coder:14b, and qwen2.5:32b all tie at 80% reasoning alongside qwen2.5:72b.

Section 05

Multi-Turn Eval: The Size Inversion

The multi-turn eval was the most surprising result of the entire project. We tested 6 models on 9 conversational scenarios using proper /api/chat endpoints with full message history. The results inverted everything the quality leaderboard suggested.

Results

RankModelScoreBest scenarioWorst scenario
1qwen2.5:7b100%All scenariosNone, perfect
2llama3.2:3b90.6%correction_handlingfalse_premise_resistance
3gemma3:12b88.6%context_retentionfalse_premise_resistance
4qwen3:14b75.1%topic_switchinggradual_refinement
5qwen3:30b61.2%false_premise_resistancegradual_refinement
6llama3.3:70b47.8%instruction_persistencegradual_refinement

The Four Patterns That Explain the Inversion

Pattern 1: Think mode cascades into failure. The gradual_refinement scenario asks models to iteratively improve code across 4 turns, add validation, add memoization, add docstring, return final version. qwen3 models in think mode spent 60-180 seconds per turn generating a reasoning chain, then often produced a subtly wrong intermediate. Turn 2 built on turn 1's error. By turn 4, the code was broken and the model was confident about it. This pattern explained nearly every qwen3 failure in the entire eval.

Pattern 2: Sycophancy scales with model size. In correction_handling, after giving a correct answer, the user pushes back with "I don't think that's right." Small models held their ground. Large models apologized and changed their answer. llama3.3:70b (the highest-quality model on static benchmarks) was the most sycophantic conversationalist. Trained on more human feedback, it learned too well that humans like it when you agree with them.

Pattern 3: instruction_persistence reveals stubbornness vs compliance. Models were told to always append a TL;DR after every response. Three scenarios later, they were explicitly told to stop. Most models acknowledged the request and then immediately appended "TL;DR: …" anyway. Only qwen2.5:7b stopped completely on the first ask.

Pattern 4: Context degradation is real above 12B. In long_context_degradation, models were given a list of 15 facts and asked questions drawn from early, middle, and late portions of the list across 6 turns. Models above 12B showed measurable accuracy drops on early-list facts by turn 6, the recency bias overwhelmed earlier context. qwen2.5:7b and llama3.2:3b showed no degradation.

The Counterintuitive Finding

Do not use a 70B model for conversation. It is slower (126 tok/s vs 10,601 tok/s), more expensive in RAM, more sycophantic, and scores 52 percentage points lower on multi-turn benchmarks than qwen2.5:7b. The size premium buys nothing for conversational tasks and actively costs quality.

Section 06

Domain Eval: Medical, Legal, Math, Code, Science

8 models, 23 tasks, 5 high-stakes domains. This eval was designed for the T4 routing question: when a query needs expert-level reasoning, which model do you send it to?

Results

RankModelOverallMedicalLegalMathCodeScience
1qwen3:30b97.8%100%100%100%90%100%
1qwen3:235b97.8%100%100%100%90%100%
3qwen3:14b93.9%100%95%75%96%100%
4qwen2.5:7b93.0%98%75%100%96%94%
5gemma3:12b91.7%98%90%90%90%90%
6llama3.2:3b83.0%86%70%90%94%74%
7llama3.3:70b80.4%82%75%95%96%56%
8deepseek-r1:14b75.7%94%70%28%90%86%

Five Domain-Specific Findings

Medical: everyone passes the floor. Even llama3.2:3b scored 86% on medical tasks. The medical_refuse task (a patient describes chest pain, respond appropriately) was passed by every model: all correctly advised calling emergency services rather than offering home remedies. Drug interaction tasks and clinical statistics (Bayes theorem with base rates) were handled correctly by all models above 7B.

qwen3 architecture dominates structured reasoning. qwen3:30b and qwen3:235b tied at 97.8%, perfect across medical, legal, math, and science. The think mode genuinely helps when the problem has a definite answer and there is no time pressure. Domain eval tasks are single-turn; the think mode cascading failure that hurt multi-turn evals is not triggered here.

llama3.3:70b science collapse (56%). The model that led the quality leaderboard at 83.8% failed on evolutionary reasoning and Fermi estimation. The science tasks that failed involve applying principles across domains rather than recalling established facts. This is a genuine capability gap, not a formatting issue.

deepseek-r1:14b math catastrophe (28%). Three of four math tasks failed. The reasoning chains were plausible but the final answers were wrong. The irrationality of √2 proof was abandoned mid-chain. This is likely a quantization artifact, the 4-bit GGUF weights clip precision in ways that cause mathematical reasoning to drift. The 32B version scores 80.6% on quality; the 14B is not a scaled-down version of that quality.

qwen2.5:7b legal weakness (75%). The only domain where qwen2.5:7b meaningfully underperforms. Contract ambiguity resolution and 4th Amendment analysis showed inconsistent reasoning across runs. Legal tasks require tracking multiple interpretive frameworks simultaneously, a task that benefits from the larger context that comes with bigger models.

Section 07

MLX vs Ollama: The Hidden 5.8x

This was the most technically surprising result of the project. MLX and Ollama run the same model weights at the same quality. Benchmarking head-to-head with identical hardware, identical prompts, and increasing concurrency revealed a performance gap that is invisible at low concurrency and massive at any real multi-user load.

Aggregate Throughput vs Concurrency (qwen2.5:7b, long outputs)

Concurrent usersMLX defaultMLX dc=8OllamaMLX advantage
1116 tok/s118 tok/s101 tok/s1.1x
2193 tok/s208 tok/s103 tok/s1.9x
4270 tok/s270 tok/s106 tok/s2.5x
8318 tok/s317 tok/s107 tok/s3.0x
16365 tok/s320 tok/s107 tok/s3.4x
32619 tok/s320 tok/s107 tok/s5.8x

Time to First Token at n=32

BackendShort outputMedium outputLong output
MLX (default)2.0s2.1s2.0s
MLX (dc=8)2.3s4.5s8.3s
Ollama1.5s11.8s29.1s

Why This Happens

Ollama serializes. When multiple requests arrive simultaneously, Ollama queues them and processes one at a time using llama.cpp. Aggregate throughput plateaus at roughly single-request performance (~107 tok/s for qwen2.5:7b) regardless of how many concurrent requests are sent. The 30th user waits for requests 1-29 to complete before getting their first token, hence 29s TTFT at n=32.

MLX batches natively. MLX's Metal backend processes multiple requests in a true batch on the GPU. Adding more concurrent requests increases GPU utilization without proportionally increasing latency. At n=32, MLX is using the hardware more efficiently than Ollama can at n=1.

The Tuning Trap

The intuitive fix, set --decode-concurrency 8 to give MLX a fixed batch size, is strictly worse than the default. It caps aggregate throughput at 320 tok/s (vs 619 for default) and increases TTFT at higher concurrency. MLX's dynamic batcher outperforms any fixed value. The right configuration is no configuration.

When Ollama Wins

For very short outputs (fewer than 20 tokens) at concurrency 1, Ollama is slightly faster (1.5s TTFT vs 2.0s). The MLX batch setup cost is not recovered on trivial outputs. This informed the T1 routing decision: llama3.2:3b on Ollama for greetings and simple fact lookups, where the overhead matters and batching provides no benefit.

Section 08

Model Council & Cascade Router

The eval data is only useful if it drives actual routing decisions. We built two systems: a Model Council for ensemble high-stakes responses, and a Cascade Router for single-model routing of every query.

The Cascade Router: 4 Tiers

The router classifies every incoming query into one of four tiers based on complexity signals, then routes to the appropriate model and backend. All assignments are backed by eval data.

TierModelBackendWhenRationale
T1. Trivialllama3.2:3bOllamaGreetings, simple facts, single-sentence answers23k tok/s; short outputs. Ollama overhead-free queuing wins
T2. Normalqwen2.5:7bMLXQA, summarization, multi-turn conversation100% multi-turn, 80.6% quality, 5.8x throughput at concurrency
T3. Complexqwen3:14bMLX (no_think)Reasoning, code review, structured analysis100% coding, 78.8% quality, batching advantage maintained
T4. Expertqwen3:30bOllama (think)Medical/legal/scientific, deep research97.8% domain, best medical, legal, math, science of any tested model

Router Benchmark Results

The router was tested against 25 labeled queries representing all four tiers. Classification uses keyword patterns, complexity heuristics, and token budget estimation.

The Model Council

For high-stakes queries where a single model answer is insufficient, the Council runs multiple models and synthesizes their outputs. Four modes:

Speculative Serving (Abandoned)

We implemented and tested a /speculate endpoint that uses qwen2.5:7b (T2) to draft tokens and qwen3:30b (T4) to verify them, the standard speculative decoding pattern. Results were negative: because the draft and verify models are so different in capability, the acceptance rate was low and the overhead of running two models outweighed the latency savings. Speculative decoding works well when draft and verify models are close in capability (e.g., 7B draft, 14B verify). The T2/T4 pairing is too dissimilar.

Section 09

Surprises & Course Corrections

1. qwen3:235b Loaded Fine, and Was Nearly Useless

256GB of unified RAM means 142GB models load without complaint. qwen3:235b loaded in about 4 minutes and answered correctly when given time. The problem: "given time" means 90-300 seconds per query under any concurrency. During the concurrent quality eval, 4 of 8 coding and reasoning tasks hit the 300-second timeout and scored zero.

Its actual domain quality (97.8%) ties qwen3:30b. But qwen3:30b runs at 3,007 tok/s. The 235B model's real-world quality per second of wall time is the worst of any model tested. Having enough RAM to run it does not mean you should.

Lesson: Maximum RAM headroom is not an invitation to run the biggest model. Measure time-to-answer, not just answer quality.


2. The Answer Key Was Wrong

Every model in quality_eval_r2 scored 0/10 on multistep_math. After the first run, this was attributed to the models being bad at multi-step math. After the second run produced the same result, we audited the answer key. The key had A=20, B=10, C=15. The correct answers are A=24, B=12, C=19.

All 22 models were scoring correctly, the scoring script was grading them wrong. A full re-score pass was required.

Lesson: When every model scores 0% on a task, the task is broken, not the models. Universal failure is a red flag that demands auditing the rubric, not the responses.


3. The qwen3:14b no_think Anomaly

For qwen3:8b and qwen3:30b, think mode improves quality with acceptable latency tradeoffs. The conventional expectation is that think mode always helps on hard tasks, just at the cost of speed.

qwen3:14b inverts this: no_think scores higher on quality while running 2x faster. Think mode on the 14B model spends tokens on reasoning that consistently reaches worse conclusions than the model's direct answer. We don't have a mechanistic explanation, this is an empirical finding, not a theoretical one.

Policy derived: Always use no_think for qwen3:14b. Test think vs no_think empirically for any model; don't assume the documentation's recommendation matches your workload.


4. --decode-concurrency 8 Made Things Worse

The MLX documentation mentions --decode-concurrency as a tuning parameter for throughput. Setting it to 8 seemed like a reasonable optimization for a multi-user workload. The benchmark showed it was strictly worse than the default at every concurrency level above 4.

The dynamic batcher that MLX uses by default adapts to the actual batch size at runtime. A fixed concurrency setting of 8 creates a ceiling, once 8 requests are batched, subsequent requests wait even if GPU capacity exists. At n=32 concurrency, the fixed setting produces 320 tok/s versus the default's 619 tok/s.

Lesson: Dynamic scheduling outperforms static configuration for variable workloads. Measure before tuning. The default is often default for a reason.


5. Concurrent Eval Runs Corrupt Each Other

Early on, we attempted to run the quality eval and the concurrency stress test simultaneously to save wall-clock time. Both results were invalid: the quality eval showed anomalously high latencies, and the concurrency test had unusually low throughput. The unified memory bus does not partition, any workload on the machine affects every other workload.

Adding 6 hours of wall-clock time to serialize all eval runs produced clean, reproducible data. The time cost was real. The data quality improvement was essential.

Lesson: On a machine with unified memory, serialization is not optional for accurate benchmarking. Build checkpoint/resume into every eval from day one, you will need it.

Section 10

Deliverables

Evaluation Scripts

ScriptDescriptionStatus
quality_eval_r2.py22-model quality eval, 16 tasks, 4 categories, deterministic scoringDone
multiturn_eval.py9-scenario multi-turn eval, /api/chat transport, NO_THINK_MODELS set, 60s timeoutDone
domain_eval.py23-task domain eval, 5 domains, 27/27 scoring tests passDone
mlx_vs_ollama.pyHead-to-head benchmark, quality + throughput + TTFT at multiple concurrency levelsDone
mlx_concurrency.pyMLX-only concurrency deep dive, default vs decode-concurrency=8 comparisonDone
concurrency_stress.pyOllama concurrency stress test, peak tok/s and safe ceiling per modelDone
rag_eval.pyRAG vs bare grounding delta, 7 modelsDone
qwen3_think_vs_nothink.pyThink mode quality and throughput comparison, 3 qwen3 modelsDone

Production Systems

FileDescriptionStatus
council/router.py4-tier cascade router, eval-backed tier assignments, dual-backend supportRunning
council/council.pyModel Council, vote, synthesize, debate, raw modesRunning
council/server.pyHTTP API on port 8080, /ask, /council, /speculate, dual-backend health endpointRunning

Data Files

FileContents
quality_r2_20260306_221534.jsonFinal Quality R2 results, 22 models (note: multistep_math rescored)
multiturn_20260306_234513.jsonMulti-turn eval, 6 models, 9 scenarios
domain_20260307_001136.jsonDomain eval, 8 models, 23 tasks
mlx_vs_ollama_20260306_225305.jsonMLX vs Ollama head-to-head benchmark
mlx_concurrency_20260307_*.jsonTwo-pass MLX concurrency deep dive (default vs dc=8)
deep_concurrency_20260306_193319.jsonOllama deep concurrency suite, 12 models
REPORT.mdConsolidated findings, all 12 sections, with full tables and recommendations

By the Numbers

2
Days
24
Models pulled
8
Eval scripts
6
Eval dimensions
116
Quality tasks
23
Domain tasks
5
Scoring bugs fixed
$0
Cloud cost
Section 11

Implementation Playbook

How to build a local LLM eval farm on Apple Silicon. Applicable to any Mac with 64GB+ unified RAM; scaled results with more RAM.

Step 1: Start With the Smallest Model That Works

Pull llama3.2:3b first. Run it at 512 concurrent requests. Understand what your machine can do at the ceiling before loading larger models. The smallest model establishes your throughput floor and your concurrency ceiling, everything else is measured against it.

Step 2: Build Checkpoint/Resume Into Every Eval From Day One

A full eval suite on 22 models takes 6-18 hours depending on model size and task complexity. Runs will crash. Models will timeout. The machine will be needed for other things. Every eval script should save state after each model response and resume from the last checkpoint. This is not optional, it is the difference between usable data and wasted compute.

Step 3: Serialize All Eval Runs

On Apple Silicon with unified memory, there is no isolation between workloads. Run one eval at a time. Build a queue if needed. Add 30-60% more wall-clock time to your estimate and accept it, the data quality difference is not subtle.

Step 4: Test Both Backends Before Committing

Install MLX (pip install mlx-lm) and run the same model on both Ollama and MLX with increasing concurrency. Do not assume one is faster, measure it. The answer depends on your concurrency pattern, output length distribution, and workload mix.

Step 5: Never Self-Judge

Use a separate model as judge. The difference between a model evaluating its own outputs versus a different model's outputs ranges from 8-15% score inflation. Designate your best available model as judge and never run it in the same eval it is scoring.

Step 6: Audit Tasks With Universal Failure

If every model fails a task, the task is broken. Check the rubric, the answer key, and the scoring logic before concluding that the models have a shared capability gap. Universal failure is a signal about the eval, not the models.

Step 7: Build the Router Last

Do not design routing rules before running evals. The router should be a consequence of the data, not a hypothesis that the data validates. Every tier assignment in the Cascade Router has a specific eval result behind it. If a tier assignment cannot be traced to a benchmark, it does not belong in the router.

The Meta-Pattern

Measure before you optimize. Serialize before you parallelize. Audit before you conclude. The surprises in this project (the size inversion, the tuning trap, the wrong answer key) were all discovered because the measurement infrastructure was rigorous enough to catch them.

The right model for any task is not the biggest model. It is the model that scores highest on that specific task type, at the throughput your workload requires, on the hardware you have.

Section 12

Appendix: Full Data Tables

Master Model Table. All Dimensions

ModelGBQuality R2Multi-turnDomainPeak tok/sRAG gain
llama3.2:3b277.5%90.6%83.0%23,264+45pp
qwen3:8b563.1%2,217+33pp
deepseek-r1:7b559.4%1,735
qwen2.5:7b580.6%100.0%93.0%10,601+60pp
llama3.2-vision:11b768.1%4,566
gemma3:12b875.6%88.6%91.7%2,949+67pp
phi4:14b977.5%2,914
qwen2.5-coder:14b981.2%2,879
deepseek-r1:14b968.8%75.7%472
qwen3:14b978.8%75.1%93.9%962
qwen2.5:14b975.6%2,868
mistral-small:22b1375.6%
gpt-oss:20b1370.0%
gemma3:27b1775.6%
qwen3-coder:30b1878.8%
qwen3:30b1876.9%61.2%97.8%3,007+26pp
qwen2.5:32b1980.6%728
deepseek-r1:32b2080.6%
llama3.3:70b4283.8%47.8%80.4%126+26pp
deepseek-r1:70b4271.2%
qwen2.5:72b4783.8%+60pp
qwen3:235b *14256.2%97.8%

Think vs No-Think (qwen3 models)

ModelModeQuality deltaThroughputRecommendation
qwen3:8bthink+13%2x slowerUse think, quality worth cost
qwen3:14bno_think+7% vs think2x fasterUse no_think, wins on both dimensions
qwen3:30bthink+24%sameUse think, no throughput cost at this size

RAG vs Bare. Full Results

ModelBare scoreRAG scoreGain
llama3.2:3b33%79%+45pp
qwen3:8b67%100%+33pp
qwen2.5:7b33%93%+60pp
gemma3:12b33%100%+67pp
qwen3:30b67%93%+26pp
llama3.3:70b67%93%+26pp
qwen2.5:72b33%93%+60pp

Infrastructure Summary

ComponentSpec / VersionRole
Mac Studio256GB unified RAM, Apple SiliconAll compute, no cloud
OllamaPort 11434, llama.cpp backendT1 + T4 model serving; multi-model management
MLXmlx-lm 0.31.0, port 8081, Metal backendT2 + T3 model serving; native batching
Council serverPython, port 8080HTTP API: /ask, /council, /speculate
Python3.14, requests + jsonAll eval scripts
Also

Related Work

This eval farm powered the AI simulation infrastructure for the Talk Stories landing page project.

Talk Stories Landing Page: 5 Rounds, 100 Simulations, +55% Conversion Lift →