An extraction pipeline that looks flawless in a demo can be quietly wrong a third of the time in production, and you will not know until a client points at a misfiled invoice total. The gap exists because demos test a handful of clean documents while production faces the long tail of formats, edge cases, and partial pages your test set never included. The only way to close that gap is to measure the right things continuously, not to eyeball a few outputs and call it good.
The trouble is that most teams either measure nothing or measure the wrong number. A single "accuracy" figure averaged across all fields hides the fact that one critical field is failing badly while four easy fields prop up the average. Real visibility comes from breaking the signal apart, instrumenting it where the work happens, and learning to read what each metric is actually telling you.
This article defines the KPIs that matter for extraction, shows how to instrument them without building a research lab, and explains how to interpret the signal so you act on the right problem.
The Metrics That Actually Matter
Aggregate accuracy is the metric everyone reaches for and the one that misleads most. Better instruments exist.
Field-level accuracy
Measure correctness per field, not per document. A pipeline pulling vendor, date, amount, and line items might score 95 percent on vendor and 60 percent on amount. The document-level average buries that, and amount is the field that gets you fired. Always break accuracy down to the field.
Precision and recall, separately
For each field, precision asks "when the model extracted a value, was it right?" and recall asks "when a value existed, did the model find it?" These fail differently. A model that skips hard fields has high precision and low recall; one that hallucinates values for missing fields has the reverse. Collapsing them into one number hides which failure you have.
Exact match versus normalized match
"$1,200.00" and "1200" may be the same value or a real error depending on your downstream system. Decide whether you score exact string match or normalized semantic match, and be consistent. Scoring exact match when your consumer normalizes anyway will make you chase phantom failures.
Schema validity rate
The share of outputs that parse cleanly and satisfy your schema. This is a leading indicator: a falling validity rate usually precedes an accuracy drop, because malformed output is the first symptom of a model struggling with new input formats.
Instrumenting Without Overbuilding
You do not need an evaluation platform to get reliable numbers. You need a labeled set and a place to log results.
Build a held-out gold set
Hand-label a few hundred real documents covering your format spread, including the ugly ones. This is your ground truth. Resist the urge to use only clean samples — the gold set exists precisely to expose the cases demos hide. Refresh it as new formats appear.
Log every extraction in production
Capture the input reference, the raw model output, the parsed result, and the schema validity result for every call. You cannot diagnose a regression you did not record. Structured logging here pays for itself the first time a client disputes a result.
Sample and review on a cadence
You cannot label all of production, so sample. Pull a random slice weekly, label it, and compare against the model output. This catches drift the gold set misses because production inputs evolve while your fixed gold set does not.
Automate scoring
Run your gold set through the pipeline on every prompt or model change and compute field-level precision and recall automatically. Manual review of changes does not scale and lets regressions through. A scoring script that takes a minute to run is the cheapest insurance you will buy.
Reading the Signal Correctly
Numbers without interpretation cause as much trouble as no numbers. Each metric points at a different fix.
Low recall on one field
The model is missing values that exist. Usually a prompting problem: the field is described ambiguously, or the examples do not cover where it appears. Fixable with sharper instructions or a targeted example, not retraining.
Low precision on one field
The model is inventing or mis-mapping values. Often a schema or grounding problem — the model fills a field rather than admitting absence. Tightening the schema to allow null and instructing the model to leave fields empty when unsure usually helps.
Falling schema validity over time
Your input distribution has shifted. New document formats are arriving that your prompt or examples do not cover. Treat this as a drift alarm and refresh your examples or gold set before accuracy follows it down.
Good aggregate, bad field
The trap. The headline number looks healthy while a critical field quietly fails. This is exactly why field-level breakdown is non-negotiable — and why you weight your attention by business impact, not by average.
Accuracy fine overall, terrible on one source
A close cousin of the field trap. Your numbers look healthy because most documents come from two clean sources, while a third source — a new vendor, a different template — fails badly and is small enough not to move the average. Segment your metrics by document source as well as by field, or this localized failure stays invisible until that source grows or someone downstream complains. Where you have the volume, breaking accuracy out by source turns a hidden pocket of failure into a visible, addressable one.
Turning Metrics Into Decisions
Metrics earn their keep only when they change what you do. Set a target accuracy for each field based on its business cost, not a blanket threshold — a payment amount might need 99 percent while a category tag is fine at 90. When a field falls below target, the failure mode tells you whether to fix the prompt, adjust the schema, or escalate to a heavier approach, a decision worth grounding in Choosing Between Few-Shot, Schema, and Fine-Tuned Extraction.
Tie your monitoring to a regular cadence so drift surfaces as a trend, not a surprise. The teams that stay reliable are the ones that treat extraction as a measured system rather than a clever prompt, an outlook reinforced across The Complete Guide to Prompting for Data Extraction and the practices in Advanced Prompting for Data Extraction: Going Beyond the Basics.
Frequently Asked Questions
Is one accuracy number ever enough?
Almost never. A single aggregate figure hides which field is failing and which failure mode you have. It is fine as a top-line dashboard tile but useless for deciding what to fix, so always keep the field-level breakdown behind it.
How big should my gold set be?
A few hundred documents that genuinely cover your format spread beats thousands of near-identical clean samples. The goal is coverage of the cases that break, not raw count. Grow it deliberately as new formats appear in production.
How often should I re-evaluate?
Run the gold set on every prompt or model change, and sample production weekly. Input distributions drift, so a metric that was healthy last month can quietly decay. Continuous sampling catches that before a client does.
What is the difference between precision and recall here?
Precision measures whether the values the model extracted are correct; recall measures whether it found the values that existed. They fail independently, so tracking them separately tells you whether the model is hallucinating or simply missing fields.
Key Takeaways
- Aggregate accuracy hides failures; always break correctness down to the field level.
- Track precision and recall separately so you know whether the model is hallucinating or missing values.
- Schema validity rate is a leading indicator of input drift — watch it before accuracy falls.
- Instrument with a held-out gold set, full production logging, weekly sampling, and automated scoring on every change.
- Each failure mode points to a specific fix; set per-field targets by business cost, not a blanket threshold.