Self-consistency makes a specific promise: sample several reasoning paths, vote, and the majority will be right more often than any single sample. Whether that promise holds for your task is an empirical question, and you answer it with measurement, not faith. Teams that adopt the technique on reputation and never instrument it often pay for sampling that buys nothing.
The good news is that the technique generates rich telemetry by construction. Every request produces multiple answers, which means you can observe agreement, the distribution of votes, and how the winning margin relates to correctness. That data tells you whether to add samples, cut them, or abandon voting entirely for a given workflow.
This guide defines the KPIs that matter, explains how to instrument them without building a research lab, and shows how to read the signal so your sample count is a decision rather than a habit.
The Metrics That Actually Matter
Plenty of numbers are available; only a few drive decisions. Focus your dashboard on these.
Accuracy lift over single-shot
The headline metric. Compare voted accuracy against the accuracy of one sample on the same labeled set. If the lift is small, voting is not earning its cost, full stop. This is the number that justifies the whole technique.
Agreement rate
The fraction of requests where samples converge on the same answer. High agreement means the task is easy and voting is redundant; low agreement means samples scatter and voting has room to help. Agreement is your early warning that sample count needs tuning.
Winning margin
How decisively the majority wins, for example three of five versus five of five. Narrow margins correlate with errors, which makes margin a useful confidence signal you can route on, escalating low-margin answers to review.
Cost per resolved request
Tokens multiplied by sample count, divided by requests. Because self-consistency is expensive by design, this number anchors every other decision and feeds directly into the business case for the technique.
Margin-accuracy correlation
A second-order metric that pays for itself: how well the winning margin predicts correctness. Plot accuracy against margin buckets and you learn whether a narrow majority really is riskier than a decisive one. If the correlation is strong, the margin becomes a reliable confidence signal you can route on. If it is weak, you know not to trust it, which is itself valuable to learn before you build escalation logic on a signal that does not hold.
Cost per correct answer
Cost per resolved request is incomplete, because a cheap wrong answer is not a bargain. Dividing total cost by the number of correct answers, rather than total answers, gives you the metric that actually trades off against value. Two configurations can have identical cost per request but very different cost per correct answer, and the second number is the one that should drive your sample-count decision.
Instrumenting Without a Research Lab
You do not need an academic harness to measure these. You need discipline about logging.
Log every sample, not just the winner
The common mistake is recording only the aggregated answer. Without the individual samples you cannot compute agreement or margin after the fact. Log all of them, with their token counts, from day one.
Maintain a labeled evaluation set
Accuracy lift requires ground truth. A modest set of a few hundred labeled examples, kept current, lets you compare configurations honestly. This set is the single most valuable asset in your self-consistency program.
Track configuration alongside results
Sample count, temperature, and model version all change the numbers. Tag every logged result with its configuration so you can compare runs rather than averaging incomparable settings together.
Separate offline evaluation from production monitoring
Offline, you tune sample count against the labeled set. In production, you cannot measure accuracy directly, so you watch agreement and margin as proxies and alert when they drift. The two loops use different metrics for different jobs.
Sample with the production distribution, not a clean one
A subtle instrumentation failure is evaluating on a tidy, balanced dataset that does not resemble live traffic. Self-consistency can look great on textbook examples and mediocre on the messy edge cases that dominate production. Whenever possible, draw your labeled set from real inputs, including the ambiguous and adversarial ones, so the agreement and accuracy numbers you tune against reflect what users actually send.
Reading the Signal
Numbers without interpretation are decoration. Here is how to turn them into decisions.
Flat accuracy lift means stop sampling
If voting barely beats a single sample, the task is either too easy or too hard for the technique. Easy tasks do not need it; hard tasks where samples are uniformly wrong cannot be fixed by voting on wrong answers.
Rising agreement with more samples means you have enough
As you add samples, agreement and accuracy climb then plateau. The plateau is your answer for sample count. Spending past it is pure waste, a pattern the advanced guide explores in depth.
Low margin as a routing signal
When the winning margin is narrow, the answer is more likely wrong. Route those cases to a verifier or a human rather than treating all voted answers as equally trustworthy. This turns a metric into an operational control.
Watch for drift in production
Model updates and shifting inputs change agreement over time. A drop in agreement rate is an early signal that your tuned sample count no longer fits, and it should trigger a re-evaluation against the labeled set.
Distinguish the two kinds of low agreement
Low agreement has two very different causes, and conflating them leads to wrong fixes. One is healthy disagreement that voting resolves correctly, where samples explore different valid paths and converge on the right answer. The other is chaotic disagreement that voting cannot resolve, where samples scatter because the task is genuinely beyond the model. The first calls for keeping or slightly increasing samples; the second calls for a different approach entirely, such as retrieval or a stronger model. The way to tell them apart is to check whether voting on the scattered samples actually raises accuracy. If it does not, more samples will not save you.
Common Measurement Mistakes
Two errors recur. The first is judging self-consistency by cost alone and cutting samples until accuracy quietly collapses; cost only means something next to accuracy lift. The second is measuring on a tiny or stale evaluation set, which produces confident numbers that do not generalize. Both are avoidable with the discipline above, and both are far cheaper to prevent than to discover after a quarter of bad answers. For teams standardizing these practices across multiple workflows, the team rollout guide covers how to make measurement a shared standard.
Frequently Asked Questions
What is the single most important metric?
Accuracy lift over a single sample. Every other metric supports the question of whether voting beats not voting, and if the lift is negligible, none of the others matter.
How big should my evaluation set be?
A few hundred labeled examples is enough to start, provided they represent your real distribution. The quality and freshness of the set matter more than raw size.
Can I measure accuracy in production?
Not directly, because production lacks ground truth. You watch proxies, agreement rate and winning margin, and you periodically re-evaluate against your labeled set offline.
What does a low agreement rate tell me?
That samples are scattering, which means voting has room to help but also that the task is hard. Pair low agreement with accuracy lift to decide whether more samples actually resolve the scatter.
How does winning margin help operationally?
Narrow margins correlate with errors, so you can route low-margin answers to extra verification or human review while letting decisive majorities pass automatically.
How do I know when I have enough samples?
Plot accuracy and agreement against sample count. Both rise then plateau; the plateau is your number. Adding samples past it spends money for no gain.
Key Takeaways
- Measure accuracy lift over single-shot first; it is the metric that justifies the technique.
- Log every individual sample, not just the winner, or you cannot compute agreement and margin later.
- Maintain a labeled evaluation set and tag results with their configuration to compare honestly.
- Use winning margin as a routing signal, escalating low-margin answers to verification.
- Watch agreement rate for drift in production and re-tune sample count when it moves.