A knowledge graph built from text is only as trustworthy as your ability to measure it. Teams routinely ship extraction pipelines whose accuracy they have never quantified, then act surprised when a downstream query returns nonsense. The graph looked plausible in spot checks, so it shipped. Plausibility is not measurement, and spot checks miss exactly the systematic errors that hurt most.
The difficulty is that graph extraction has more failure modes than a typical classification task. A triple can be wrong because the entity is wrong, because the relationship is wrong, because the entity was correct but duplicated, or because a true relationship was never extracted at all. A single accuracy number flattens all of that into a figure that tells you almost nothing about where to spend your next engineering hour.
This piece defines the metrics that actually distinguish a good extraction pipeline from a bad one, explains how to instrument them without a labeling army, and shows how to read the resulting signal so you intervene on the right problem rather than the loudest one.
The underlying principle is simple even though the execution is not: you cannot improve what you cannot see, and a graph hides its own errors better than almost any other data artifact. A broken web service throws an error; a broken graph quietly returns a confident wrong answer. Measurement is what converts that silence into a signal you can act on, and the teams that take measurement seriously are invariably the ones whose graphs people end up trusting with real decisions.
The Metrics That Separate Real Quality From Hope
Borrow the precision and recall framing from information retrieval, but apply it at the level of triples, not documents.
Triple-level precision and recall
Precision asks: of the triples you extracted, what fraction are correct? Recall asks: of the triples that should have been extracted, what fraction did you capture? These move in opposite directions as you tune the prompt, and reporting only one hides the trade you are making.
Entity resolution accuracy
Separate from relationship correctness is the question of identity. If "Acme Corp" and "Acme Corporation" become two nodes, your graph is wrong even though every individual triple is correct. Measure the rate at which distinct surface forms collapse to the right canonical node.
Schema conformance rate
If you use a closed schema, measure how often the output actually conforms before any validation cleanup. A low raw conformance rate signals that your prompt or model is fighting the schema, which predicts trouble at scale.
Building a Gold Set Without Drowning in Labels
Every meaningful metric needs ground truth, and ground truth needs human judgment. The trick is spending that judgment efficiently.
- Stratify your sample. Pull documents across the range of types you actually process, not just the easy ones. A gold set of clean documents flatters a pipeline that fails on messy input.
- Label triples, not documents. Have annotators mark which extracted triples are correct and which true triples were missed. This directly yields precision and recall.
- Reuse and grow. Each labeling round adds to a permanent evaluation set. Over time you accumulate a regression suite that catches degradation when you change prompts or models.
A few hundred carefully labeled documents beat tens of thousands of unlabeled ones. The discipline is choosing what to label, not labeling more.
Instrumenting the Pipeline in Production
Offline metrics on a gold set tell you about a frozen snapshot. Production metrics tell you what is happening now.
Confidence and abstention signals
Have the model report confidence or allow it to abstain on uncertain extractions. The rate of low-confidence outputs is a leading indicator: a sudden rise usually means your input distribution shifted, often before precision visibly drops.
Provenance coverage
Every edge should point to a source span. Measure the fraction of edges with valid provenance. Missing provenance is both a quality problem and a debugging blocker, and it pairs directly with the governance concerns in Silent Schema Drift and Other Graph Extraction Traps.
Reading the Signal Without Fooling Yourself
Numbers invite self-deception. A high precision figure on an easy gold set means nothing if production data is harder.
Watch the precision-recall frontier, not a single point
When you change a prompt, plot where precision and recall land relative to before. An improvement that trades a lot of recall for a little precision may be a regression in disguise, depending on what your downstream consumer needs.
Segment by document type
An aggregate metric hides per-segment failures. If your pipeline is excellent on contracts and terrible on emails, the average looks acceptable while half your graph is garbage. Always report metrics sliced by the dimensions that vary.
Connecting Metrics to Decisions
Metrics earn their cost only when they change what you do. Tie each metric to an action.
Thresholds that trigger work
Set a precision floor below which output gets human review rather than auto-ingestion. Set a recall floor below which you revisit the prompt or schema. Set a conformance floor below which you suspect a model or formatting regression. Without thresholds, metrics become decoration.
Cost-aware quality targets
Higher quality usually costs more tokens, more review, or both. The right target is the one where marginal quality stops being worth marginal cost for your use case, a calculation that connects directly to What Knowledge Graph Extraction Actually Saves a Data Team.
Common Measurement Pitfalls
Even teams that measure can measure badly, and a misleading metric is more dangerous than no metric because it manufactures false confidence.
Optimizing the metric instead of the graph
When a single number becomes the goal, people tune the pipeline to move that number rather than to improve the graph. A prompt change that lifts precision by suppressing every uncertain extraction looks like progress and quietly destroys recall. Always watch the metrics you are not optimizing, because that is where the regression hides.
Grading on the easy cases
A gold set assembled from clean, cooperative documents reports flattering numbers that collapse the moment real input arrives. Stratify the gold set across the full difficulty range you actually process, including the messy documents you wish you did not have to handle. A metric is only as honest as the sample it runs on.
Confusing conformance with correctness
A high schema-conformance rate tempts teams into believing the graph is good. Conformance only proves the output has the right shape, not that it states the truth. Treat the two as independent and measure both, because a perfectly shaped graph full of false triples passes every structural check while being worthless.
Frequently Asked Questions
What single metric should I report if I can only pick one?
Resist picking one. If forced, report triple-level F1, the harmonic mean of precision and recall, because it punishes you for ignoring either. But always keep the underlying precision and recall visible, since F1 alone hides which direction you are failing.
How large does my gold set need to be?
Large enough that your metrics are stable across resampling, which for most extraction tasks means a few hundred labeled documents spanning your real input variety. Stability matters more than raw size; a noisy metric from a tiny set will mislead you.
Can I use the model to grade itself?
Model-assisted grading is useful for triage and scaling, but never let it replace a human-labeled gold set entirely. A model that makes a systematic extraction error will often make the same error when grading, hiding the very problem you need to find.
How do I measure recall when I do not know all the true triples?
You estimate it on the gold set where humans have enumerated the true triples for those documents. You cannot measure recall on unlabeled production data directly, which is exactly why a representative gold set is irreplaceable.
What does a sudden drop in conformance rate mean?
Usually a change in input distribution or a model update altering output formatting. Treat it as an early warning to investigate before the quality degradation reaches your stored graph.
Key Takeaways
- Measure at the triple level with precision, recall, and F1; a single accuracy number hides the failure mode you most need to see.
- Entity resolution accuracy and schema conformance are distinct quality dimensions that triple correctness alone does not capture.
- A small, stratified, reusable gold set outperforms a large unlabeled one and becomes your regression suite.
- Instrument production with confidence signals and provenance coverage to catch distribution shifts early.
- Tie every metric to a threshold and an action, or it becomes decoration rather than a decision tool.