You cannot decide between open and closed models with adjectives. "Faster," "smarter," and "cheaper" are claims, not measurements, and the only way to compare two models honestly is to instrument both against the same workload and read the numbers. The teams that get this wrong almost always skipped this step.
This guide defines the metrics that actually separate open from closed options, shows how to instrument them, and explains how to read the signal when the numbers conflict — because they will. Quality, cost, and latency rarely point in the same direction, and the art is in weighting them for your use case.
Quality Metrics: Measure on Your Data, Not Theirs
Public benchmarks tell you how a model performs on someone else's test set. They tell you almost nothing about your task.
Build a task-specific eval set
Collect 50 to 200 real examples from your actual workload — real prompts, real expected outputs. This is your ground truth. Run every candidate model against it. A model that tops MMLU can still fumble your specific extraction task, and a smaller open model can quietly beat a frontier closed one on a narrow domain you have data for.
Choose a scoring method that fits the task
- Exact match or F1 for classification and structured extraction, where there is one right answer.
- LLM-as-judge for open-ended generation, where a stronger model grades outputs against a rubric. Cheap, fast, and surprisingly reliable if your rubric is specific.
- Human review for the highest-stakes outputs, sampled rather than exhaustive.
The step-by-step guide covers building this eval harness in detail.
Cost Metrics: Get Past the Sticker Price
Cost per successful task, not per token
Per-token pricing is a trap. A cheaper model that needs three retries or longer prompts to hit acceptable quality can cost more per successful task than an expensive one that nails it first try. Always normalize to cost per completed unit of work.
Total cost of ownership for open models
Self-hosted open models have no per-token line item, so people assume they are free. They are not. Track GPU rental or amortized hardware, idle-time waste from low utilization, and the engineering hours to keep inference healthy. Divide all of it by tokens served to get a real per-token number you can compare to a closed API. The ROI breakdown shows the full calculation.
Latency and Throughput
The four numbers worth tracking
- Time to first token (TTFT): What the user feels as responsiveness, especially in streaming UIs.
- Tokens per second: How fast the full response completes.
- P95 and P99 latency: Tail latency, not averages — the slow requests are what generate complaints.
- Throughput under concurrency: How the model holds up when 50 requests arrive at once.
Closed APIs abstract throughput away but cap your concurrency and can throttle you. Self-hosted open models give you full control of throughput but make it your job to provision for peak load.
Operational and Reliability Metrics
Availability and rate limits
For closed models, track provider uptime, rate-limit rejections, and how often you hit quota ceilings. For open self-hosted models, track your own uptime, GPU failures, and cold-start times. A model that is 2% better on quality but rate-limited during your peak hours is a worse choice.
Reproducibility and drift
Closed model versions can change underneath you, shifting outputs without warning. Pin versions where the provider allows it, and re-run your eval set on every version bump. Open weights are frozen — what you tested is what you run forever — which is a genuine advantage for regulated or audit-heavy workloads.
Putting It Together: The Scorecard
Build a single weighted scorecard. Assign each metric a weight that reflects your priorities, score each candidate model, and let the total decide. A customer-facing chat product might weight latency and quality heavily. A nightly batch pipeline might weight cost above all and barely care about TTFT.
The discipline is forcing yourself to assign the weights before you see the scores. Decide what matters first, then measure — otherwise you will rationalize whichever model you already preferred. The framework article provides a scorecard template you can adapt.
Instrumenting the Measurement in Practice
Defining metrics is half the job; capturing them reliably is the other half. The teams that measure well treat instrumentation as a permanent fixture, not a one-time bake-off.
Log every request with its full context
For each model call, record the input, the output, latency (TTFT and total), token counts, the model and version, and whether the result passed your quality check. This turns every production request into a data point. When you later evaluate a new open model, you already have a real distribution of traffic to replay against it instead of a synthetic test.
Run candidates in shadow mode
The most honest comparison sends real production traffic to a candidate model in parallel with the live one, scores both, and ships nothing user-facing until the numbers justify it. Shadow evaluation surfaces the messy reality — the long-tail inputs, the format quirks, the latency under real concurrency — that a curated eval set misses. It is the single most reliable way to compare an open candidate against your incumbent closed model.
Watch the metrics that move slowly
Some signals only appear over time: gradual quality drift after a closed-model version change, creeping cost as prompts grow, or rising tail latency as traffic increases. Set thresholds and alerts on these, not just point-in-time checks. The common mistakes guide covers the measurement traps teams fall into here.
Reading Conflicting Signals
The hard part is rarely a single number — it is what to do when quality, cost, and latency disagree. A cheaper open model that is 3% worse on quality but half the cost and faster might be the right call for a batch pipeline and the wrong call for a high-stakes customer interaction. There is no formula; the weighted scorecard exists precisely to force you to encode your priorities in advance so the trade-off resolves itself. When two models are within noise on your scorecard, default to the one with lower operational risk — usually the one you already run well.
Frequently Asked Questions
Why not just use public benchmark scores?
Public benchmarks measure performance on standardized test sets that rarely resemble your task. A model can rank first on a leaderboard and still underperform on your specific extraction, classification, or domain-specific generation. Always validate on an eval set built from your own real examples.
What is the single most important metric?
Cost per successful task, because it folds quality, retries, and pricing into one comparable number. A model that is cheap per token but frequently wrong is expensive per outcome. That said, latency overrides everything for real-time user-facing products.
How big should my eval set be?
Between 50 and 200 real examples is usually enough to separate good from bad candidates with confidence. Below 50 you get noisy results; above 200 you hit diminishing returns for early decisions. Grow the set over time as you discover edge cases in production.
How do I compare a self-hosted open model's cost to a closed API?
Sum your GPU costs, idle waste, and engineering time over a period, then divide by tokens served in that period. That gives an effective per-token cost you can compare directly to the closed provider's published rate. Most teams are surprised how high the open number runs at low utilization.
Key Takeaways
- Decide metric weights before you see results, then measure on your own data.
- Cost per successful task beats cost per token every time.
- Track TTFT, tokens per second, and P95/P99 — not just averages.
- Open weights win on reproducibility; closed models can drift under you.
- A weighted scorecard turns a subjective debate into a decision.