Reading the Signal: Agreement, Accuracy, and Cost in Sampled Voting

Self-consistency makes a specific promise: sample several reasoning paths, vote, and the majority will be right more often than any single sample. Whether that promise holds for your task is an empirical question, and you answer it with measurement, not faith. Teams that adopt the technique on reputation and never instrument it often pay for sampling that buys nothing.

The good news is that the technique generates rich telemetry by construction. Every request produces multiple answers, which means you can observe agreement, the distribution of votes, and how the winning margin relates to correctness. That data tells you whether to add samples, cut them, or abandon voting entirely for a given workflow.

This guide defines the KPIs that matter, explains how to instrument them without building a research lab, and shows how to read the signal so your sample count is a decision rather than a habit.

The Metrics That Actually Matter

Plenty of numbers are available; only a few drive decisions. Focus your dashboard on these.

Accuracy lift over single-shot

The headline metric. Compare voted accuracy against the accuracy of one sample on the same labeled set. If the lift is small, voting is not earning its cost, full stop. This is the number that justifies the whole technique.

Agreement rate

The fraction of requests where samples converge on the same answer. High agreement means the task is easy and voting is redundant; low agreement means samples scatter and voting has room to help. Agreement is your early warning that sample count needs tuning.

Winning margin

How decisively the majority wins, for example three of five versus five of five. Narrow margins correlate with errors, which makes margin a useful confidence signal you can route on, escalating low-margin answers to review.

Cost per resolved request

Tokens multiplied by sample count, divided by requests. Because self-consistency is expensive by design, this number anchors every other decision and feeds directly into the business case for the technique.

Margin-accuracy correlation

A second-order metric that pays for itself: how well the winning margin predicts correctness. Plot accuracy against margin buckets and you learn whether a narrow majority really is riskier than a decisive one. If the correlation is strong, the margin becomes a reliable confidence signal you can route on. If it is weak, you know not to trust it, which is itself valuable to learn before you build escalation logic on a signal that does not hold.

Cost per correct answer

Cost per resolved request is incomplete, because a cheap wrong answer is not a bargain. Dividing total cost by the number of correct answers, rather than total answers, gives you the metric that actually trades off against value. Two configurations can have identical cost per request but very different cost per correct answer, and the second number is the one that should drive your sample-count decision.

Instrumenting Without a Research Lab

You do not need an academic harness to measure these. You need discipline about logging.

Log every sample, not just the winner

The common mistake is recording only the aggregated answer. Without the individual samples you cannot compute agreement or margin after the fact. Log all of them, with their token counts, from day one.

Maintain a labeled evaluation set

Accuracy lift requires ground truth. A modest set of a few hundred labeled examples, kept current, lets you compare configurations honestly. This set is the single most valuable asset in your self-consistency program.

Track configuration alongside results

Sample count, temperature, and model version all change the numbers. Tag every logged result with its configuration so you can compare runs rather than averaging incomparable settings together.

Separate offline evaluation from production monitoring

Offline, you tune sample count against the labeled set. In production, you cannot measure accuracy directly, so you watch agreement and margin as proxies and alert when they drift. The two loops use different metrics for different jobs.

Sample with the production distribution, not a clean one

A subtle instrumentation failure is evaluating on a tidy, balanced dataset that does not resemble live traffic. Self-consistency can look great on textbook examples and mediocre on the messy edge cases that dominate production. Whenever possible, draw your labeled set from real inputs, including the ambiguous and adversarial ones, so the agreement and accuracy numbers you tune against reflect what users actually send.

Reading the Signal

Numbers without interpretation are decoration. Here is how to turn them into decisions.

Flat accuracy lift means stop sampling

If voting barely beats a single sample, the task is either too easy or too hard for the technique. Easy tasks do not need it; hard tasks where samples are uniformly wrong cannot be fixed by voting on wrong answers.

Rising agreement with more samples means you have enough

As you add samples, agreement and accuracy climb then plateau. The plateau is your answer for sample count. Spending past it is pure waste, a pattern the advanced guide explores in depth.

Low margin as a routing signal

When the winning margin is narrow, the answer is more likely wrong. Route those cases to a verifier or a human rather than treating all voted answers as equally trustworthy. This turns a metric into an operational control.

Watch for drift in production

Model updates and shifting inputs change agreement over time. A drop in agreement rate is an early signal that your tuned sample count no longer fits, and it should trigger a re-evaluation against the labeled set.

Distinguish the two kinds of low agreement

Low agreement has two very different causes, and conflating them leads to wrong fixes. One is healthy disagreement that voting resolves correctly, where samples explore different valid paths and converge on the right answer. The other is chaotic disagreement that voting cannot resolve, where samples scatter because the task is genuinely beyond the model. The first calls for keeping or slightly increasing samples; the second calls for a different approach entirely, such as retrieval or a stronger model. The way to tell them apart is to check whether voting on the scattered samples actually raises accuracy. If it does not, more samples will not save you.

Common Measurement Mistakes

Two errors recur. The first is judging self-consistency by cost alone and cutting samples until accuracy quietly collapses; cost only means something next to accuracy lift. The second is measuring on a tiny or stale evaluation set, which produces confident numbers that do not generalize. Both are avoidable with the discipline above, and both are far cheaper to prevent than to discover after a quarter of bad answers. For teams standardizing these practices across multiple workflows, the team rollout guide covers how to make measurement a shared standard.

Frequently Asked Questions

What is the single most important metric?

Accuracy lift over a single sample. Every other metric supports the question of whether voting beats not voting, and if the lift is negligible, none of the others matter.

How big should my evaluation set be?

A few hundred labeled examples is enough to start, provided they represent your real distribution. The quality and freshness of the set matter more than raw size.

Can I measure accuracy in production?

Not directly, because production lacks ground truth. You watch proxies, agreement rate and winning margin, and you periodically re-evaluate against your labeled set offline.

What does a low agreement rate tell me?

That samples are scattering, which means voting has room to help but also that the task is hard. Pair low agreement with accuracy lift to decide whether more samples actually resolve the scatter.

How does winning margin help operationally?

Narrow margins correlate with errors, so you can route low-margin answers to extra verification or human review while letting decisive majorities pass automatically.

How do I know when I have enough samples?

Plot accuracy and agreement against sample count. Both rise then plateau; the plateau is your number. Adding samples past it spends money for no gain.

Key Takeaways

Measure accuracy lift over single-shot first; it is the metric that justifies the technique.
Log every individual sample, not just the winner, or you cannot compute agreement and margin later.
Maintain a labeled evaluation set and tag results with their configuration to compare honestly.
Use winning margin as a routing signal, escalating low-margin answers to verification.
Watch agreement rate for drift in production and re-tune sample count when it moves.

This guide defines the KPIs that matter, explains how to instrument them without building a research lab, and shows how to read the signal so your sample count is a decision rather than a habit.

The Metrics That Actually Matter

Plenty of numbers are available; only a few drive decisions. Focus your dashboard on these.

Accuracy lift over single-shot

Agreement rate

Winning margin

Cost per resolved request

Margin-accuracy correlation

Cost per correct answer

Instrumenting Without a Research Lab

You do not need an academic harness to measure these. You need discipline about logging.

Log every sample, not just the winner

Maintain a labeled evaluation set

Track configuration alongside results

Sample count, temperature, and model version all change the numbers. Tag every logged result with its configuration so you can compare runs rather than averaging incomparable settings together.

Separate offline evaluation from production monitoring

Sample with the production distribution, not a clean one

Reading the Signal

Numbers without interpretation are decoration. Here is how to turn them into decisions.

Flat accuracy lift means stop sampling

Rising agreement with more samples means you have enough

As you add samples, agreement and accuracy climb then plateau. The plateau is your answer for sample count. Spending past it is pure waste, a pattern the advanced guide explores in depth.

Low margin as a routing signal

Watch for drift in production

Distinguish the two kinds of low agreement

Common Measurement Mistakes

Frequently Asked Questions

What is the single most important metric?

Accuracy lift over a single sample. Every other metric supports the question of whether voting beats not voting, and if the lift is negligible, none of the others matter.

How big should my evaluation set be?

A few hundred labeled examples is enough to start, provided they represent your real distribution. The quality and freshness of the set matter more than raw size.

Can I measure accuracy in production?

Not directly, because production lacks ground truth. You watch proxies, agreement rate and winning margin, and you periodically re-evaluate against your labeled set offline.

What does a low agreement rate tell me?

That samples are scattering, which means voting has room to help but also that the task is hard. Pair low agreement with accuracy lift to decide whether more samples actually resolve the scatter.

How does winning margin help operationally?

Narrow margins correlate with errors, so you can route low-margin answers to extra verification or human review while letting decisive majorities pass automatically.

How do I know when I have enough samples?

Plot accuracy and agreement against sample count. Both rise then plateau; the plateau is your number. Adding samples past it spends money for no gain.

Key Takeaways

Measure accuracy lift over single-shot first; it is the metric that justifies the technique.
Log every individual sample, not just the winner, or you cannot compute agreement and margin later.
Maintain a labeled evaluation set and tag results with their configuration to compare honestly.
Use winning margin as a routing signal, escalating low-margin answers to verification.
Watch agreement rate for drift in production and re-tune sample count when it moves.

Reading the Signal: Agreement, Accuracy, and Cost in Sampled Voting

The Metrics That Actually Matter

Accuracy lift over single-shot

Agreement rate

Winning margin

Cost per resolved request

Margin-accuracy correlation

Cost per correct answer

Instrumenting Without a Research Lab

Log every sample, not just the winner

Maintain a labeled evaluation set

Track configuration alongside results

Separate offline evaluation from production monitoring

Sample with the production distribution, not a clean one

Reading the Signal

Flat accuracy lift means stop sampling

Rising agreement with more samples means you have enough

Low margin as a routing signal

Watch for drift in production

Distinguish the two kinds of low agreement

Common Measurement Mistakes

Frequently Asked Questions

What is the single most important metric?

How big should my evaluation set be?

Can I measure accuracy in production?

What does a low agreement rate tell me?

How does winning margin help operationally?

How do I know when I have enough samples?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Reading the Signal: Agreement, Accuracy, and Cost in Sampled Voting

The Metrics That Actually Matter

Accuracy lift over single-shot

Agreement rate

Winning margin

Cost per resolved request

Margin-accuracy correlation

Cost per correct answer

Instrumenting Without a Research Lab

Log every sample, not just the winner

Maintain a labeled evaluation set

Track configuration alongside results

Separate offline evaluation from production monitoring

Sample with the production distribution, not a clean one

Reading the Signal

Flat accuracy lift means stop sampling

Rising agreement with more samples means you have enough

Low margin as a routing signal

Watch for drift in production

Distinguish the two kinds of low agreement

Common Measurement Mistakes

Frequently Asked Questions

What is the single most important metric?

How big should my evaluation set be?

Can I measure accuracy in production?

What does a low agreement rate tell me?

How does winning margin help operationally?

How do I know when I have enough samples?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?