The abstract case for self-consistency is easy to follow but hard to feel. What makes it click is watching it applied to specific tasks, seeing exactly what the votes looked like, and noticing what tipped each one toward success or failure. This article walks through several concrete scenarios, including one where the technique was the wrong choice, because the failures teach as much as the wins.
Each example follows the same shape: the task, why a single pass was risky, what voting did, and what made it work or not. The scenarios are illustrative composites of common patterns, not reports of specific named deployments, so treat the numbers as representative rather than measured. The underlying mechanics are covered in Sampling Many Answers and Voting on the Best One.
Multi-Step Arithmetic Word Problems
The task
A model has to read a word problem, set up the arithmetic, and produce a single number. The reasoning involves several dependent steps, and one slip anywhere flips the final answer.
Why one pass was risky
On these problems, a single chain of thought sometimes drops a term or mis-multiplies. The answer looks confident and is simply wrong. Run the same prompt twice and you may get two different numbers.
What voting did
Sampling seven times at moderate temperature produced a clear majority on the correct number, with the occasional arithmetic slip showing up as isolated minority answers. The minority votes were exactly the bad chains, outvoted by the many correct routes. This is the textbook case where self-consistency shines.
Invoice and Document Field Extraction
The task
Pull a discrete field, such as a total amount or a due date, out of a messy document. The answer is a single comparable value.
Why one pass was risky
Layout noise and ambiguous formatting cause occasional misreads. A single extraction might grab the subtotal instead of the total, and nothing flags it.
What voting did
Five samples usually agreed on the correct field, with misreads appearing as minority votes. The key was normalization: amounts had to be standardized so "1,200.00" and "1200" counted as one vote. Without that step, covered in Seven Ways Self-Consistency Voting Quietly Goes Wrong, the real winner would have fractured.
Support Ticket Triage
The task
Classify an incoming ticket into one of a fixed set of categories so it routes to the right queue.
Why one pass was risky
Borderline tickets sit between two categories, and a single pass can land on either depending on phrasing. Misroutes cost time and frustrate customers.
What voting did
Voting across samples turned borderline cases into visible close margins. A clean majority routed automatically; a near-tie flagged the ticket for a human. The margin itself became the triage signal, which is the habit recommended in Sharp Habits for Voting Across Model Samples.
A Logic Puzzle With a Discrete Answer
The task
Solve a constraint puzzle where the answer is one specific arrangement or value, reachable by several reasoning routes.
Why one pass was risky
These puzzles are exactly where models reason confidently into a dead end. A single sample has a meaningful chance of taking a wrong branch.
What voting did
Because correct solutions converge and wrong branches diverge, the right answer accumulated votes while errors scattered across several distinct wrong answers. The margin was wide, giving high confidence in the result.
Where Voting Was the Wrong Tool
The task
Generate a polished customer-facing email. The "answer" is a paragraph of prose.
Why voting failed
No two emails are identical, so there was no majority to find. Tallying produced as many unique answers as samples. The technique had nothing to count.
The better approach
For open-ended generation, voting on the whole output does not apply. If you must, extract a discrete attribute, such as a tone label or a yes-no policy check, and vote on that. Otherwise reach for a different method. Recognizing this boundary is the difference between using the tool and misusing it.
Multiple-Choice Knowledge Questions
The task
Answer a question by selecting one option from a fixed list, A through D, where the reasoning requires combining several facts.
Why one pass was risky
When the distractors are plausible, a single chain of thought can talk itself into a wrong option, especially if it fixates on one supporting detail and ignores a disqualifying one.
What voting did
Sampling several times surfaced the correct option as a clear majority on questions the model genuinely knew, while questions it did not know produced fractured votes spread across options. That fragmentation was useful in itself: it flagged the questions where the model was guessing rather than reasoning, which is exactly the signal you want when accuracy matters.
Code Output Prediction
The task
Given a short function and an input, predict the exact value it returns. The answer is a single discrete value.
Why one pass was risky
Mentally executing code is error-prone. A single trace can mishandle a loop boundary or an off-by-one and confidently report the wrong return value.
What voting did
Across several traces, the correct value accumulated votes while individual tracing slips appeared as scattered wrong answers. Because the return value is discrete and comparable, the tally worked cleanly once values were normalized for formatting. This is a strong fit for the same reasons arithmetic is: many correct routes, idiosyncratic wrong ones.
What the Winning Cases Share
A single comparable answer
Every successful example above produces one value you can check for exact equality: a number, a label, an option letter, a date. That property is not incidental; it is the precondition for voting to mean anything. When you evaluate a new task, the first question is always whether two correct answers would be identical.
Convergent correct reasoning
The wins also share a deeper trait: there are many valid ways to reach the right answer and comparatively few, scattered ways to reach each wrong one. That asymmetry is what makes the correct answer cluster. Tasks without it, where wrong answers are just as convergent as right ones, do not benefit, which is part of why the email task failed beyond just being prose.
Normalization done right
In the extraction, triage, and code examples, the tally only worked because answers were standardized before counting. The pattern is consistent enough to state as a rule: any task with multiple surface forms for the same value needs deliberate normalization, or its vote will fracture regardless of how well the model reasons.
Frequently Asked Questions
What kinds of tasks are the best fit?
Anything with a discrete, comparable answer: arithmetic, structured extraction, classification, and constraint puzzles. The shared trait is that two correct answers are exactly equal, which is what voting requires.
Why did normalization matter in the extraction example?
Because amounts and dates have many surface forms. "1,200.00" and "1200" are the same value but different strings, and an unnormalized tally counts them separately, splitting the real winner. Standardizing the format restores the true majority.
How does the margin help with triage?
A wide margin means confident routing; a thin one means the case is genuinely ambiguous. By auto-routing clear majorities and flagging close calls, you get both automation and a safety valve from the same vote.
Could I use voting for the email task with any tweak?
Only by changing what you vote on. You cannot vote on prose, but you can vote on a discrete decision derived from it, like whether the draft meets a policy. The full text still needs a different technique.
Are these example numbers from real deployments?
They are representative composites of common patterns, not measurements from a specific named system. Use them to understand the shape of results, then measure your own, since exact rates vary by model and task.
How many samples did these examples assume?
Five to seven, which is a typical working range. Harder tasks like the logic puzzle benefit from the higher end; cleaner tasks like field extraction often stabilize at five.
Key Takeaways
- Self-consistency shines on multi-step math, extraction, classification, and logic puzzles with discrete answers.
- In each winning case, correct reasoning converged into a majority while errors scattered as minority votes.
- Normalization is essential for extraction tasks, or surface-form differences split the real winner.
- The vote margin doubles as a triage signal: auto-route clear majorities, flag close calls.
- Voting fails on open-ended prose because no two outputs are comparable.
- For open-ended tasks, vote on a discrete attribute extracted from the output, or use a different method.