AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Step 1: Define What "Confident" Should Mean HereSet your stakesChoose a confidence vocabularyStep 2: Build a Small Test SetAssemble twenty graded questionsRecord the ground truthStep 3: Capture a BaselineLog expressed confidence and correctnessSpot the failure patternStep 4: Write the Calibration PromptGrant permission and require labelsForce the reasoning before the verdictDemand source-grounding where possibleStep 5: Re-Run and CompareMeasure the right thingWatch for the over-correctionStep 6: Tighten and Lock It InAdjust one variable at a timeSave the winning prompt as a reusable assetFrequently Asked QuestionsHow many test questions do I really need?What if I do not know the correct answers myself?Should I tune the temperature setting too?Why reason before stating the confidence level?How often do I need to re-run this process?Does this work for code or only for facts?Key Takeaways
Home/Blog/Drawing Honest Uncertainty Out of a Model, Step by Step
General

Drawing Honest Uncertainty Out of a Model, Step by Step

A

Agency Script Editorial

Editorial Team

·August 9, 2020·7 min read
calibrating model confidence through promptscalibrating model confidence through prompts how tocalibrating model confidence through prompts guideprompt engineering

Knowing that a model can sound more sure than it should is one thing. Doing something about it, reliably, on a real task, is another. This guide is the do-this-then-that version. It walks through a sequence you can run today to make a model's expressed confidence track its actual reliability, and to keep it that way as your prompts evolve.

The process has a shape: establish a baseline, build a small test set, write calibration instructions, measure whether they worked, and tighten. You do not need a research lab. You need a handful of questions where you know the right answer, a place to record results, and the discipline to compare before and after. Skip the measurement step and you are just guessing whether your prompt helped.

Work through the steps in order the first time. Once you have done it on one task, the loop becomes fast — most of the effort is the one-time setup of a test set you can reuse. Treat the steps below as a checklist you actually execute, not a description you skim.

Step 1: Define What "Confident" Should Mean Here

Before changing any prompts, decide what good calibration looks like for your specific task. Calibration is not abstract; it is relative to consequences.

Set your stakes

Ask what happens if the model is confidently wrong. A confidently wrong recipe substitution is annoying. A confidently wrong dosage or legal figure is dangerous. The higher the stakes, the more you should bias toward explicit uncertainty.

Choose a confidence vocabulary

Pick one scheme and use it everywhere so results are comparable:

  • A three-level scale: high, medium, low.
  • A numeric percentage the model estimates.
  • A binary "verified claim" versus "inference" tag.

Three levels are the right starting point for most people — granular enough to sort by, simple enough to stay consistent.

Step 2: Build a Small Test Set

You cannot improve what you cannot measure, and you cannot measure calibration without questions whose answers you already know.

Assemble twenty graded questions

Write or collect around twenty items in the domain you care about, mixing difficulty:

  • Several you are certain the model should get right.
  • Several genuinely hard or ambiguous ones.
  • A few where the honest answer is "there is no single answer."

That last group is gold. A well-calibrated model should refuse to fake certainty on the unanswerable ones.

Record the ground truth

Note the correct answer, or "unknowable," beside each question. This becomes your answer key for every test run.

Step 3: Capture a Baseline

Run your test set through the model with no calibration instructions at all — just the raw questions. This is the control you will compare against.

Log expressed confidence and correctness

For each answer, record two things: how sure the model sounded, and whether it was actually right. Even a rough high/medium/low read on tone is enough. You are looking for the gap — cases where it sounded sure and was wrong, or sounded unsure and was right.

Spot the failure pattern

Most baselines lean one way. Many models are systematically overconfident, especially on the ambiguous items, stating contested answers as fact. Knowing your starting bias tells you what your prompt needs to correct. The common mistakes guide catalogs the patterns you are likely to see here.

Step 4: Write the Calibration Prompt

Now add instructions designed to fix the bias you found. Layer these techniques rather than relying on any single one.

Grant permission and require labels

Combine two moves into your system or task prompt:

"It is acceptable, and expected, to say you are unsure or that an answer cannot be determined. After each claim, attach a confidence level of high, medium, or low, and one sentence justifying it."

Force the reasoning before the verdict

Ask the model to lay out its evidence before committing to a confidence level, not after:

"First state the evidence for and against your answer. Only then give your answer and its confidence level."

Reasoning first tends to produce more honest confidence than reasoning summoned to defend a conclusion already stated.

Demand source-grounding where possible

For factual work:

"Mark any claim you cannot trace to provided context as low confidence."

This connects confidence to evidence rather than to fluency. For the structured version of these layered moves, see a framework for calibrating model confidence through prompts.

Step 5: Re-Run and Compare

Run the exact same test set with your new prompt. Same questions, same model, only the instructions changed.

Measure the right thing

You are not chasing more hedging. You want the high-confidence answers to be reliably correct and the low-confidence ones to be where the errors cluster. Specifically check:

  • Did confidently-wrong answers drop?
  • Did the model correctly flag the unanswerable items?
  • Did it over-hedge on easy items it should nail?

Watch for the over-correction

A prompt that makes the model hedge on everything has not calibrated it — it has made it useless. If every answer comes back "medium," your instruction is too blunt. Dial it back so confidence still discriminates.

Step 6: Tighten and Lock It In

Calibration is iterative. One pass rarely lands it.

Adjust one variable at a time

Change a single instruction, re-run the set, compare. Changing several things at once means you will not know what helped. This discipline is the difference between tuning and flailing.

Save the winning prompt as a reusable asset

Once a prompt calibrates well, store it as a template you reuse across tasks in this domain. Re-test it whenever you switch models, since calibration behavior does not transfer cleanly between them. The checklist for 2026 makes a handy pre-flight before you ship a calibrated prompt into production.

Frequently Asked Questions

How many test questions do I really need?

Around twenty is enough to start and to see clear patterns without becoming a chore to grade. The key is variety: include easy items, hard items, and genuinely unanswerable ones. As the task grows in stakes, expand the set, but do not let perfect coverage stop you from running the first cheap pass.

What if I do not know the correct answers myself?

Then you cannot truly measure calibration on those items, only the model's internal consistency. Build your test set from questions where you can establish ground truth — documented facts, settled cases, or problems with a checkable solution. Save the genuinely uncertain questions to test whether the model appropriately refuses to fake certainty.

Should I tune the temperature setting too?

It can help. Lower temperature tends to make outputs more deterministic, which interacts with how confidence is expressed, but it is not a substitute for explicit calibration instructions. Treat it as one variable to adjust in step six, changing it alone and re-running your set so you can see its isolated effect.

Why reason before stating the confidence level?

Because a confidence level produced after a committed answer tends to rationalize that answer rather than assess it honestly. Asking for the evidence on both sides first means the confidence rating reflects the actual balance of support, which produces more trustworthy self-reports across a test set.

How often do I need to re-run this process?

Re-run whenever you change the model, meaningfully change the prompt, or move to a new domain. Calibration does not transfer cleanly across any of those. A saved test set makes re-running cheap, so treat it as a regression check rather than a one-time project.

Does this work for code or only for facts?

It works for both, with a tweak. For code, the ground truth is "does it run and pass tests," so your test set is small programming tasks with known outcomes. Ask the model to flag suggestions it has not mentally executed as lower confidence, then verify by actually running them.

Key Takeaways

  • Calibration is a measurable loop: baseline, test set, calibration prompt, re-measure, tighten.
  • Build a small set of around twenty questions with known answers, including unanswerable ones, before changing any prompts.
  • Capture a no-instruction baseline so you can prove whether your prompt actually helped.
  • Layer techniques — grant permission to be unsure, require confidence labels, and reason before committing.
  • The goal is discrimination, not hedging: high-confidence answers should be reliably right, low-confidence ones where errors cluster.
  • Change one variable at a time, save winning prompts as reusable templates, and re-test whenever the model changes.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification