A prompt that catches errors is only as trustworthy as your ability to prove it. Most teams run error-detection prompts on faith, assuming that fluent output means correct output, and they discover the gap only when a missed error reaches a client. The fix is to measure, and measuring well means picking the right small set of metrics rather than drowning in numbers nobody acts on.
This article defines the KPIs that actually tell you whether your error-detection prompting works, explains how to instrument each one, and, just as importantly, how to read the signal. A number you cannot interpret is worse than no number, because it invites false confidence. The aim is a dashboard small enough to look at and honest enough to trust.
Measurement also closes the loop on everything else. The staged process in The DETECT Loop: A Reusable Model for Catching AI Errors ends in a Track stage, and these metrics are what that stage tracks. Without them, improvement is just guessing dressed up as iteration.
The Foundation: A Known-Bad Test Set
Every meaningful metric requires labeled data.
Why you need it
You cannot measure catch rate without knowing how many errors were there to catch. A set of documents with known, planted errors is the ground truth against which every other metric is computed.
How to build it
Collect real examples, plant or label known defects across the categories you care about, and keep the set versioned. Grow it whenever a new failure type escapes into production. This calibration discipline is the same one in Hard-Won Rules for Error-Checking Prompts That Hold Up.
Catch Rate (Recall)
The headline metric is the share of real errors the prompt finds.
How to instrument it
Run the prompt against your known-bad set and compute caught errors divided by total planted errors. Track it per error category, because a prompt can have great overall recall while systematically missing one class.
How to read it
Rising catch rate is good, but read it together with false positives. A prompt that flags everything has perfect recall and is useless. Recall only means something alongside precision.
False-Positive Rate (Precision)
The counterweight is how often a flagged item is not actually an error.
How to instrument it
Of the items the prompt flagged, measure the fraction that were not real errors. This requires reviewing flags against ground truth, which your known-bad set supports.
How to read it
A high false-positive rate erodes trust and wastes reviewer time, eventually causing editors to ignore flags entirely. The false-positive storms described in Five Error-Detection Prompts, Walked Through End to End are precisely what this metric catches.
Escaped-Error Rate
The metric that matters most to clients is what slips all the way through.
How to instrument it
Count errors discovered after the work shipped, divided by total work shipped, ideally normalized per thousand words or per release. Source these from client reports, post-publication audits, or production incidents.
How to read it
This is your true outcome metric. Catch rate and precision are leading indicators; escaped-error rate is the lagging reality. A workflow can look healthy on leading metrics and still leak, which is why you track both.
Correction-Introduced Error Rate
A subtle but vital metric is how often correction creates new problems.
How to instrument it
In your verification pass, count corrections that resolved the flagged error but introduced a new one, divided by total corrections. This isolates the danger of overcorrection.
How to read it
A nonzero rate here is the quantitative case for never skipping verification. It is the metric that proves the failure mode from Seven Ways Error-Detection Prompts Quietly Fail You is real in your own workflow.
Human Review Load
An operational metric keeps the workflow sustainable.
How to instrument it
Track the share of flagged items routed to human review and the time spent per item. This tells you whether your confidence thresholds are calibrated.
How to read it
If review load is climbing without a matching drop in escaped errors, your thresholds are too conservative and you are paying for scrutiny that is not buying safety. Tune until review effort concentrates on the items that actually need it.
Reading the Metrics Together
Individual numbers mislead; the pattern across them tells the real story.
Common patterns and what they mean
- High catch rate, high false positives: the prompt is over-flagging. Tighten the error taxonomy and watch precision recover without recall collapsing.
- High catch rate, low false positives, but rising escaped errors: your test set no longer reflects production. New error types are escaping because they were never in the labeled data. Refresh the set.
- Low correction-introduced errors but climbing review load: thresholds are too conservative. You are paying for human scrutiny that is not preventing escapes.
- Everything healthy except escaped errors hold steady: a class of error your prompt simply cannot see. Add it to the test set and redesign the detection prompt for it.
Why the pattern beats the number
Any single metric can be gamed or can mislead in isolation. A prompt with perfect recall and terrible precision looks great on one axis and is useless. Reading the metrics as a set is what turns a dashboard into a diagnosis.
Instrumenting Without Heavy Tooling
You do not need a platform to start measuring.
A lightweight setup
- Keep the known-bad set as a folder of labeled documents under version control.
- Run prompts against it with a simple script and record caught, missed, and false-flagged counts in a spreadsheet.
- Log escaped errors as they surface in client reports or production, tagged by type.
- Review the small dashboard on a fixed cadence and after every prompt change.
Why lightweight is enough at first
The discipline of measuring matters more than the sophistication of the tooling. A spreadsheet you actually update beats a dashboard nobody reads. As volume grows, the tooling categories in Choosing Tooling That Backs Your Error-Detection Prompts become worth the investment, but they are an optimization, not a prerequisite.
Frequently Asked Questions
Which single metric matters most?
Escaped-error rate, because it is the outcome clients actually experience. Catch rate and precision are leading indicators that predict it, but escaped-error rate is the lagging truth you are ultimately accountable for.
Why measure catch rate and false positives together?
Because either alone is gameable. A prompt that flags everything has perfect recall and useless precision; a prompt that flags nothing has the reverse. Only the pair tells you whether the prompt is genuinely discriminating.
How big does the known-bad test set need to be?
Large enough to expose systematic misses per error category, which often means a few dozen labeled examples to start. Grow it every time a new failure type escapes, so the set reflects your real risk surface.
What does a high correction-introduced error rate tell me?
That correction is creating new defects and that skipping verification would let them ship. It is the quantitative justification for keeping the verification pass mandatory on anything that matters.
How often should I recompute these metrics?
Recompute the labeled-set metrics whenever you change a prompt, and review escaped-error rate on a regular cadence such as monthly or per release. The first tells you if a change helped; the second tells you if reality agrees.
How do I avoid drowning in metrics?
Track this small set and act on it. Catch rate, false positives, escaped errors, correction-introduced errors, and review load cover the workflow. More numbers without corresponding actions just manufacture false confidence.
Turning Metrics Into Decisions
A metric only earns its place if it changes what you do.
Pairing each metric with an action
- Catch rate drops below your bar: redesign the detection prompt for the missed category and re-test against the known-bad set.
- False positives climb: tighten the error taxonomy and narrow what the model is allowed to flag.
- Escaped errors rise while leading metrics look fine: refresh the test set, because production has outgrown it.
- Correction-introduced errors appear: make the verification pass mandatory and investigate the overcorrection.
- Review load climbs without fewer escapes: loosen overly conservative confidence thresholds.
Why the pairing matters
Numbers without paired actions become wallpaper, glanced at and ignored. Tying every metric to a specific response is what makes measurement a steering wheel rather than a rearview mirror, and it is what lets the Track stage of The DETECT Loop: A Reusable Model for Catching AI Errors actually improve the loop over time.
Key Takeaways
- A versioned known-bad test set is the foundation every other metric depends on.
- Read catch rate and false-positive rate together; neither is meaningful alone.
- Escaped-error rate is the true outcome metric clients actually experience.
- Correction-introduced error rate is the quantitative case for mandatory verification.
- Human review load tells you whether confidence thresholds are well calibrated.
- Keep the metric set small and act on it, or it just manufactures false confidence.