AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Foundation: A Known-Bad Test SetWhy you need itHow to build itCatch Rate (Recall)How to instrument itHow to read itFalse-Positive Rate (Precision)How to instrument itHow to read itEscaped-Error RateHow to instrument itHow to read itCorrection-Introduced Error RateHow to instrument itHow to read itHuman Review LoadHow to instrument itHow to read itReading the Metrics TogetherCommon patterns and what they meanWhy the pattern beats the numberInstrumenting Without Heavy ToolingA lightweight setupWhy lightweight is enough at firstFrequently Asked QuestionsWhich single metric matters most?Why measure catch rate and false positives together?How big does the known-bad test set need to be?What does a high correction-introduced error rate tell me?How often should I recompute these metrics?How do I avoid drowning in metrics?Turning Metrics Into DecisionsPairing each metric with an actionWhy the pairing mattersKey Takeaways
Home/Blog/Signals That an Error-Detection Prompt Is Working
General

Signals That an Error-Detection Prompt Is Working

A

Agency Script Editorial

Editorial Team

·July 5, 2021·7 min read
prompting for error detection and correctionprompting for error detection and correction metricsprompting for error detection and correction guideprompt engineering

A prompt that catches errors is only as trustworthy as your ability to prove it. Most teams run error-detection prompts on faith, assuming that fluent output means correct output, and they discover the gap only when a missed error reaches a client. The fix is to measure, and measuring well means picking the right small set of metrics rather than drowning in numbers nobody acts on.

This article defines the KPIs that actually tell you whether your error-detection prompting works, explains how to instrument each one, and, just as importantly, how to read the signal. A number you cannot interpret is worse than no number, because it invites false confidence. The aim is a dashboard small enough to look at and honest enough to trust.

Measurement also closes the loop on everything else. The staged process in The DETECT Loop: A Reusable Model for Catching AI Errors ends in a Track stage, and these metrics are what that stage tracks. Without them, improvement is just guessing dressed up as iteration.

The Foundation: A Known-Bad Test Set

Every meaningful metric requires labeled data.

Why you need it

You cannot measure catch rate without knowing how many errors were there to catch. A set of documents with known, planted errors is the ground truth against which every other metric is computed.

How to build it

Collect real examples, plant or label known defects across the categories you care about, and keep the set versioned. Grow it whenever a new failure type escapes into production. This calibration discipline is the same one in Hard-Won Rules for Error-Checking Prompts That Hold Up.

Catch Rate (Recall)

The headline metric is the share of real errors the prompt finds.

How to instrument it

Run the prompt against your known-bad set and compute caught errors divided by total planted errors. Track it per error category, because a prompt can have great overall recall while systematically missing one class.

How to read it

Rising catch rate is good, but read it together with false positives. A prompt that flags everything has perfect recall and is useless. Recall only means something alongside precision.

False-Positive Rate (Precision)

The counterweight is how often a flagged item is not actually an error.

How to instrument it

Of the items the prompt flagged, measure the fraction that were not real errors. This requires reviewing flags against ground truth, which your known-bad set supports.

How to read it

A high false-positive rate erodes trust and wastes reviewer time, eventually causing editors to ignore flags entirely. The false-positive storms described in Five Error-Detection Prompts, Walked Through End to End are precisely what this metric catches.

Escaped-Error Rate

The metric that matters most to clients is what slips all the way through.

How to instrument it

Count errors discovered after the work shipped, divided by total work shipped, ideally normalized per thousand words or per release. Source these from client reports, post-publication audits, or production incidents.

How to read it

This is your true outcome metric. Catch rate and precision are leading indicators; escaped-error rate is the lagging reality. A workflow can look healthy on leading metrics and still leak, which is why you track both.

Correction-Introduced Error Rate

A subtle but vital metric is how often correction creates new problems.

How to instrument it

In your verification pass, count corrections that resolved the flagged error but introduced a new one, divided by total corrections. This isolates the danger of overcorrection.

How to read it

A nonzero rate here is the quantitative case for never skipping verification. It is the metric that proves the failure mode from Seven Ways Error-Detection Prompts Quietly Fail You is real in your own workflow.

Human Review Load

An operational metric keeps the workflow sustainable.

How to instrument it

Track the share of flagged items routed to human review and the time spent per item. This tells you whether your confidence thresholds are calibrated.

How to read it

If review load is climbing without a matching drop in escaped errors, your thresholds are too conservative and you are paying for scrutiny that is not buying safety. Tune until review effort concentrates on the items that actually need it.

Reading the Metrics Together

Individual numbers mislead; the pattern across them tells the real story.

Common patterns and what they mean

  • High catch rate, high false positives: the prompt is over-flagging. Tighten the error taxonomy and watch precision recover without recall collapsing.
  • High catch rate, low false positives, but rising escaped errors: your test set no longer reflects production. New error types are escaping because they were never in the labeled data. Refresh the set.
  • Low correction-introduced errors but climbing review load: thresholds are too conservative. You are paying for human scrutiny that is not preventing escapes.
  • Everything healthy except escaped errors hold steady: a class of error your prompt simply cannot see. Add it to the test set and redesign the detection prompt for it.

Why the pattern beats the number

Any single metric can be gamed or can mislead in isolation. A prompt with perfect recall and terrible precision looks great on one axis and is useless. Reading the metrics as a set is what turns a dashboard into a diagnosis.

Instrumenting Without Heavy Tooling

You do not need a platform to start measuring.

A lightweight setup

  • Keep the known-bad set as a folder of labeled documents under version control.
  • Run prompts against it with a simple script and record caught, missed, and false-flagged counts in a spreadsheet.
  • Log escaped errors as they surface in client reports or production, tagged by type.
  • Review the small dashboard on a fixed cadence and after every prompt change.

Why lightweight is enough at first

The discipline of measuring matters more than the sophistication of the tooling. A spreadsheet you actually update beats a dashboard nobody reads. As volume grows, the tooling categories in Choosing Tooling That Backs Your Error-Detection Prompts become worth the investment, but they are an optimization, not a prerequisite.

Frequently Asked Questions

Which single metric matters most?

Escaped-error rate, because it is the outcome clients actually experience. Catch rate and precision are leading indicators that predict it, but escaped-error rate is the lagging truth you are ultimately accountable for.

Why measure catch rate and false positives together?

Because either alone is gameable. A prompt that flags everything has perfect recall and useless precision; a prompt that flags nothing has the reverse. Only the pair tells you whether the prompt is genuinely discriminating.

How big does the known-bad test set need to be?

Large enough to expose systematic misses per error category, which often means a few dozen labeled examples to start. Grow it every time a new failure type escapes, so the set reflects your real risk surface.

What does a high correction-introduced error rate tell me?

That correction is creating new defects and that skipping verification would let them ship. It is the quantitative justification for keeping the verification pass mandatory on anything that matters.

How often should I recompute these metrics?

Recompute the labeled-set metrics whenever you change a prompt, and review escaped-error rate on a regular cadence such as monthly or per release. The first tells you if a change helped; the second tells you if reality agrees.

How do I avoid drowning in metrics?

Track this small set and act on it. Catch rate, false positives, escaped errors, correction-introduced errors, and review load cover the workflow. More numbers without corresponding actions just manufacture false confidence.

Turning Metrics Into Decisions

A metric only earns its place if it changes what you do.

Pairing each metric with an action

  • Catch rate drops below your bar: redesign the detection prompt for the missed category and re-test against the known-bad set.
  • False positives climb: tighten the error taxonomy and narrow what the model is allowed to flag.
  • Escaped errors rise while leading metrics look fine: refresh the test set, because production has outgrown it.
  • Correction-introduced errors appear: make the verification pass mandatory and investigate the overcorrection.
  • Review load climbs without fewer escapes: loosen overly conservative confidence thresholds.

Why the pairing matters

Numbers without paired actions become wallpaper, glanced at and ignored. Tying every metric to a specific response is what makes measurement a steering wheel rather than a rearview mirror, and it is what lets the Track stage of The DETECT Loop: A Reusable Model for Catching AI Errors actually improve the loop over time.

Key Takeaways

  • A versioned known-bad test set is the foundation every other metric depends on.
  • Read catch rate and false-positive rate together; neither is meaningful alone.
  • Escaped-error rate is the true outcome metric clients actually experience.
  • Correction-introduced error rate is the quantitative case for mandatory verification.
  • Human review load tells you whether confidence thresholds are well calibrated.
  • Keep the metric set small and act on it, or it just manufactures false confidence.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification