AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Stage One: FrameThe question it answersWhat it decidesWhen it matters mostStage Two: SampleThe question it answersWhat it decidesWhen it matters mostStage Three: ResolveThe question it answersWhat it decidesWhen it matters mostStage Four: GateThe question it answersWhat it decidesWhen it matters mostPutting the Stages TogetherThe loop in one passWhere teams skip stagesApplying the Framework to a New TaskWalk the stages in orderConfigure Sample and ResolveSet the GateWhy a Named Framework HelpsShared language across a teamA structure that survives model changesA built-in audit checklistCommon Ways the Framework Is MisappliedCollapsing Sample and ResolveLetting Gate default to accept-everythingFrequently Asked QuestionsWhy separate Frame from Sample?Is Resolve just tallying?Can I run the framework without Gate?Does the framework change for different model sizes?How does this relate to comparative analysis prompting?What is the fastest way to audit a deployment with this?Key Takeaways
Home/Blog/Sample, Cluster, Vote: A Reusable Model for Consistency
General

Sample, Cluster, Vote: A Reusable Model for Consistency

A

Agency Script Editorial

Editorial Team

Β·October 24, 2021Β·7 min read
self-consistency prompting techniqueself-consistency prompting technique frameworkself-consistency prompting technique guideprompt engineering

Running self-consistency well is not hard, but doing it consistently across many tasks and many people requires a shared mental model. A framework gives you that: a named set of stages so that "we use self-consistency here" means the same thing to everyone, and so that reviewing a deployment is a matter of checking each stage rather than re-deriving the whole approach.

This article lays out a four-stage model, Frame, Sample, Resolve, Gate, that covers the full life of a self-consistency decision. Each stage answers a specific question, has clear inputs and outputs, and has a rule for when it matters most. The stages are sequential but the framework is reusable: once you know it, you can apply it to any discrete-answer task without starting from scratch.

If you want the raw mechanics first, Sampling Many Answers and Voting on the Best One covers them. This piece is about structuring those mechanics into something repeatable.

Stage One: Frame

The question it answers

Frame decides whether self-consistency belongs here at all, and if so, what counts as an answer. Its inputs are the task and its stakes; its output is a go or no-go plus a defined answer format.

What it decides

Two things. First, task fit: the answer must be discrete and comparable, the single pass must wobble, and the stakes must justify the cost. Second, the answer contract: the exact format the model must emit so answers can be extracted and compared. Skipping Frame is how teams end up voting on prose.

When it matters most

Always, but especially on new task types. Most failed self-consistency deployments fail here, by applying voting where it cannot work or where it adds nothing.

Stage Two: Sample

The question it answers

Sample decides how to generate the diversity that voting feeds on. Its inputs are the base prompt and the answer contract; its output is a set of independent samples.

What it decides

Temperature, sample count, and independence. Temperature near 0.7 diversifies reasoning; the count is tuned to where the winner stabilizes; and every call must be isolated so agreement signals correctness, not herding. The full setup sequence is in Running a Self-Consistency Vote, One Step at a Time.

When it matters most

On hard tasks, where too few samples or too little randomness leaves the vote noisy. Harder problems push the count higher and demand careful temperature tuning.

Stage Three: Resolve

The question it answers

Resolve turns a pile of samples into a single answer and a confidence margin. Its inputs are the raw samples; its outputs are the winning answer and the vote split.

What it decides

Extraction, normalization, and tallying. Pull the answer from each sample using the contract, normalize so equal answers compare as equal, then count and take the mode. Normalization is the stage's load-bearing step; neglecting it produces false ties, the failure cataloged in Seven Ways Self-Consistency Voting Quietly Goes Wrong.

When it matters most

On tasks with many surface forms for the same value, like numbers and dates, where unnormalized tallying silently fractures the winner.

Stage Four: Gate

The question it answers

Gate decides what to do with the result given its confidence. Its inputs are the winning answer and margin; its output is an action: accept, resample, or escalate.

What it decides

The confidence threshold and the routing rule. Clear majorities accept automatically; thin margins trigger more sampling or human review. This is where the margin earns its keep, a habit emphasized in Sharp Habits for Voting Across Model Samples.

When it matters most

On high-stakes tasks, where the cost of acting on a shaky majority is severe. Gate is what converts a raw vote into a safe operational decision.

Putting the Stages Together

The loop in one pass

Frame once per task type to set fit and format. Then for each query: Sample to generate diversity, Resolve to vote and measure confidence, Gate to decide. Frame is design-time; the other three run every query.

Where teams skip stages

The two most-skipped stages are Frame and Gate, the bookends. Teams jump straight to sampling and voting, applying the technique where it does not fit and acting on every majority regardless of margin. Naming the stages makes those omissions visible in review.

Applying the Framework to a New Task

Walk the stages in order

Suppose a new request arrives: extract a contract's renewal date. Frame first: the answer is a discrete date (comparable), single passes wobble on messy contracts (worth voting), and a wrong renewal date is costly (stakes justify it). Define the answer contract as an ISO date on a final line. Frame passes.

Configure Sample and Resolve

Sample: set temperature near 0.7, start at seven independent calls, and tune from there. Resolve: extract the date line, normalize all date formats to ISO so equal dates tally as equal, then take the mode. The normalization step is doing the heavy lifting here, because dates have many surface forms.

Set the Gate

Gate: decide that a margin under two votes routes to human review, since a wrong renewal date is expensive. Now the task is fully specified by the four stages, and anyone reviewing it can check each stage's decision independently. The same walk applies to any discrete-answer task, which is the point of having a reusable model.

Why a Named Framework Helps

Shared language across a team

When everyone uses Frame, Sample, Resolve, and Gate, a review conversation becomes precise. "The Gate is missing" is clearer and more actionable than "the confidence handling feels off." Named stages turn vague unease into specific, fixable gaps.

A structure that survives model changes

Models will change, and the right settings inside Sample will change with them. But the questions each stage answers do not. A framework anchored on questions rather than parameters stays valid across model generations, which is what makes it worth internalizing rather than re-deriving each time.

A built-in audit checklist

Because each stage makes one explicit decision, reviewing a deployment becomes a matter of confirming four decisions were made deliberately rather than by default. A missing Frame, an untuned Sample, a normalization gap in Resolve, or an absent Gate each maps to a documented failure mode. The framework thus pulls double duty: it structures how you build a self-consistency system and how you inspect one someone else built.

Common Ways the Framework Is Misapplied

Collapsing Sample and Resolve

Teams sometimes treat sampling and tallying as one undifferentiated step and skip the extraction and normalization work that lives in Resolve. The result is a tally that counts raw, inconsistent strings and fractures the real winner. Keeping Resolve distinct forces you to confront normalization as its own deliberate task rather than an afterthought.

Letting Gate default to accept-everything

The most quietly damaging misapplication is implementing Frame, Sample, and Resolve faithfully but never building Gate, so every majority is accepted regardless of margin. The pipeline runs, produces winners, and looks complete, yet it acts on thin majorities as confidently as on landslides. Naming Gate as a required stage is what makes its absence conspicuous instead of invisible.

Frequently Asked Questions

Why separate Frame from Sample?

Because they answer different questions. Frame decides whether and what; Sample decides how. Collapsing them is how teams start sampling before confirming the task even suits voting, which wastes the whole effort.

Is Resolve just tallying?

It is tallying plus the two steps that make tallying trustworthy: extraction and normalization. The count is trivial; getting comparable, clean answers into the count is the real work of the stage.

Can I run the framework without Gate?

You can accept every majority blindly, but then you ignore the confidence signal the technique produces and act on thin margins as if they were certain. Gate is what makes the result safe to use operationally.

Does the framework change for different model sizes?

The stages stay the same; the settings inside Sample may shift. A stronger model might stabilize with fewer samples or a different temperature, but Frame, Resolve, and Gate are unaffected.

How does this relate to comparative analysis prompting?

The same Sample-Resolve-Gate loop applies to comparison: run a judgment several times, aggregate, and gate on agreement. The connection is drawn out in Side-by-Side Reasoning Is Getting Cheaper and Sharper.

What is the fastest way to audit a deployment with this?

Walk the four stages and confirm each made an explicit decision. A missing Frame, an untuned Sample, a normalization gap in Resolve, or an absent Gate each maps to a known failure mode, so the framework doubles as a review structure.

Key Takeaways

  • The framework has four stages: Frame, Sample, Resolve, and Gate, each answering a distinct question.
  • Frame decides whether voting fits and what counts as an answer; it is design-time and most often skipped.
  • Sample sets temperature, count, and independence to generate the diversity voting needs.
  • Resolve extracts, normalizes, and tallies, with normalization as its load-bearing step.
  • Gate uses the vote margin to accept, resample, or escalate, making the result safe to act on.
  • Frame runs once per task type; Sample, Resolve, and Gate run on every query, and the four stages double as an audit structure.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification