A single number — "our prompt is 92% accurate" — feels like it tells you something, but on its own it tells you almost nothing useful. Accurate by what definition? On which kinds of problems? When it is wrong, how wrong, and does it know it is wrong? Two prompts can post identical accuracy figures while one is safe to ship and the other is a liability waiting to surface.
Measuring numerical reasoning well means instrumenting for the failure modes that actually cause damage, not just counting right answers. A model that is right 92% of the time and silently confident on the wrong 8% is more dangerous than one that is right 88% of the time and flags its own uncertainty. The metrics you choose decide which of those two you can tell apart.
This article defines the KPIs worth tracking, explains how to instrument them so the numbers are honest, and shows how to read the resulting signal to make decisions. The throughline: measure the things that change what you ship, and ignore the vanity numbers that only make a slide look good.
The Metrics That Matter
Exact-match accuracy, segmented
Start with the obvious metric — did the final number match the expected answer — but never report it as a single aggregate. Segment it by problem type, difficulty, and magnitude. A prompt can be flawless on small integers and fall apart on percentages or large multiplications, and the aggregate hides exactly that.
- By difficulty: simple, multi-step, and edge cases reported separately.
- By operation: arithmetic, ratios, date math, financial calculations.
- By magnitude: small numbers, large numbers, and values near zero or sign boundaries.
Error magnitude, not just error rate
Knowing how often the model is wrong is half the picture; knowing how far off it is when wrong is the other half. A pricing prompt that errs by a cent is annoying; one that errs by a factor of ten is catastrophic. Track the distribution of error sizes, not just the count of errors.
Calibration
Calibration measures whether the model's confidence matches its correctness. A well-calibrated system is right when it sounds sure and hedges when it is not. Poor calibration — confident wrong answers — is the single most dangerous property in numerical work, because it defeats every downstream human check.
Tool-use compliance
If your design routes calculations to a deterministic tool, measure how often the model actually used the tool versus computing in its head. A high silent-bypass rate is a leading indicator of accuracy decay that aggregate accuracy will not reveal until it is too late.
Verification coverage and catch rate
Two metrics describe the health of your safety net rather than the model. Coverage is the share of numeric outputs that pass through a verifier at all; a number that never reaches a check is unprotected regardless of how accurate the model is on average. Catch rate is the share of injected or naturally occurring errors that the verifier actually rejects. Together they tell you whether your verification is doing its job or merely present for show. A verifier with high coverage but a low catch rate is a comfort blanket, not a control.
Instrumenting So the Numbers Are Honest
Good metrics depend on honest measurement, and there are several ways to fool yourself.
Build a labeled evaluation set that reflects reality
Your test set must include the hard cases, not just the easy ones, in roughly the proportion they appear in production. A set dominated by trivial problems will report a flattering accuracy that collapses the moment real traffic arrives. Deliberately seed it with edge cases: zero, negatives, very large values, and the malformed inputs users actually send.
Capture intermediate steps, not just final answers
Logging only the final number tells you that a prompt failed, never why. Capture the reasoning trace, every tool call, and the intermediate values. When a number is wrong, you want to replay the exact path and see whether the model set up the problem wrong, called the wrong tool, or fumbled a handoff. This connects directly to the observability layer described in Which Tools Actually Make Models Do Math Reliably.
Separate evaluation data from development data
If you tune your prompt against the same examples you measure on, your numbers are fiction. Hold out a clean evaluation set the prompt never sees during development, and refresh it periodically so it does not leak into your iteration loop.
Reading the Signal
Numbers are only useful if you know how to act on them.
Watch the segments, not the average
A flat aggregate accuracy can mask a sharp decline in one segment. When you ship a prompt change, compare segment-by-segment, because an improvement on easy cases can hide a regression on the hard ones that matter most.
Treat calibration drift as an early warning
If the model starts sounding confident on cases it gets wrong, that is a signal to investigate before accuracy itself drops. Miscalibration usually precedes a visible accuracy decline.
Set thresholds tied to consequence
Define the accuracy and error-magnitude bars in terms of what a failure costs, not an arbitrary round number. A medical or financial figure demands a far tighter bar than an internal estimate. The right threshold is a business decision informed by the analysis in Putting Real Numbers on the Payback of Better Math Prompts.
Avoiding Metric Theater
The failure mode here is optimizing the number instead of the outcome. A team that chases a single accuracy figure will eventually game it — tuning on the test set, dropping hard cases, or reporting the friendliest aggregate. The result is a dashboard that looks healthy while production quietly degrades.
The antidote is a small basket of metrics that resist gaming because they pull in different directions. Exact-match keeps you honest on correctness; error magnitude keeps you honest on severity; calibration keeps you honest on trustworthiness; tool-use compliance keeps you honest on whether the safe path is actually being taken. When all four move in the right direction, you have real improvement, not a better-looking slide. Sustaining this across a group requires shared definitions, which is why Spreading Math-Prompt Discipline Through a Whole Team treats measurement as a standard rather than an afterthought.
Frequently Asked Questions
Is accuracy a bad metric?
Not bad, just incomplete. Accuracy belongs in your basket, but reported alone and unsegmented it hides where and how badly a prompt fails. Pair it with error magnitude and calibration to get a trustworthy picture.
What is calibration and why does it matter so much?
Calibration is the match between how confident the model sounds and how often it is correct. It matters because a confident wrong answer defeats human review — the people checking the output trust the confidence and miss the error. Good calibration makes the whole system safer.
How big should my evaluation set be?
Large enough to include your hard cases in realistic proportions and to give stable segment-level numbers. A few hundred carefully chosen examples that cover edge cases beat thousands of trivial ones. Quality and coverage matter more than raw size.
Why log intermediate steps if I only care about the answer?
Because the final answer tells you a prompt failed but never why. Intermediate traces let you replay the failure and see whether the setup, the tool call, or the handoff broke — which is the difference between guessing at a fix and knowing one.
How do I keep my metrics from being gamed?
Use several metrics that pull in different directions, hold out evaluation data the prompt never trains on, and refresh that set periodically. Gaming one metric then shows up as a regression in another, which keeps the basket honest.
How often should I re-measure?
Re-measure on every meaningful prompt or model change, and on a regular cadence even without changes, because upstream model updates can shift behavior underneath you. Continuous measurement turns surprises into early warnings.
Key Takeaways
- A single accuracy figure hides more than it reveals; segment by difficulty, operation, and magnitude.
- Track error magnitude alongside error rate, because how wrong a number is often matters more than how often.
- Calibration — confidence matching correctness — is the most safety-critical metric, since confident wrong answers defeat human review.
- Honest measurement requires a realistic held-out evaluation set, captured intermediate steps, and separation of test data from development data.
- Read segments rather than averages, and treat calibration drift as an early warning of coming accuracy decline.
- Use a small basket of metrics that pull in different directions to resist gaming and surface real improvement.