Best practices for extraction are easy to state as platitudes—"be specific," "validate your output"—and useless at that altitude. What follows is a set of practices earned from extraction projects that worked and ones that did not, each with the reasoning that makes it worth following. Take the reasoning seriously; it tells you when a practice applies and when your situation is the exception.
The through-line of every practice here is the same: a knowledge graph is only as good as its consistency, and consistency comes from constraint. Models, left unconstrained, produce fluent variety, and variety is poison to a graph that depends on matching entities and relationships across thousands of documents. Good practice is the disciplined application of constraint at every stage.
These are not ranked by importance, because their value depends on your domain and scale. Read them as a toolkit and apply the ones that address your actual risks.
Lead With the Schema, Always
Constraint Is the Whole Game
The schema is the first thing in your prompt because it governs everything after it. A closed list of entity and relation types, each with a one-line definition, forces the model into a vocabulary you can match across documents. Skip this and you spend the rest of the project fighting synonym fragmentation. Make the schema the foundation, not an afterthought.
Define Relations Operationally
Do not just name a relation—define when it applies. "acquired: subject organization purchased a controlling stake in object organization" tells the model exactly where the boundary sits. Operational definitions cut ambiguity at the source and make your extraction reproducible. This rigor underpins the framework for structuring extraction prompts.
Ground Every Triple in Evidence
Require Source Spans
Make the model attach the exact supporting text to every triple. This serves two purposes: it discourages fabrication, because the model must point at evidence, and it makes verification mechanical, because you can check the span against the source. Source spans are the cheapest insurance against the most expensive failure.
Forbid Inference Unless It Is the Goal
By default, instruct the model to extract only stated facts. Inferred relationships are seductive and often wrong. If your use case genuinely needs inference, make it an explicit, separate, clearly labeled task so inferred edges never masquerade as stated ones.
Engineer the Output Contract
Demand Strict, Parseable Structure
Specify exact JSON with exact field names and require the model to emit only valid JSON. A strict contract is what lets you automate the pipeline; a loose one buys you endless cleanup code and intermittent parse failures. The corrective value of this is spelled out in Why Graph Extraction Prompts Silently Drop Half Your Entities.
Validate on Receipt, Not Later
Parse and schema-check every response the moment it arrives. Failing fast keeps corrupt data out of your graph and surfaces prompt drift immediately. Validation deferred is validation skipped.
Treat Entity Resolution as a First-Class Step
Plan It Before You Scale
Duplicate entities are guaranteed once you process more than one document. Decide your canonical-name strategy and resolution approach before you scale, not after you discover a graph full of duplicates. Resolution is not cleanup; it is core infrastructure, as the end-to-end case study makes clear.
Keep a Reference List Where You Can
When you have a known set of entities—your customers, a drug database, a company registry—give the model that list and ask it to map extracted names to canonical ones. Anchoring to a reference dramatically improves consistency over free-form normalization.
Measure Relentlessly
Build a Gold-Standard Set Early
A small set of manually labeled documents lets you compute precision and recall. Without it you are guessing. Build it before you scale, because you cannot improve what you cannot measure, and you cannot trust what you have not measured.
Track Both Precision and Recall
Precision catches fabricated and mislabeled triples; recall catches the facts you missed. Optimizing one while ignoring the other produces a graph that is either full of noise or full of holes. Watch both and tune toward the balance your use case needs.
Iterate on One Variable at a Time
Change, Measure, Repeat
When precision or recall disappoints, adjust a single element—a relation definition, the grounding rule, the worked example—and remeasure. Changing several things at once tells you nothing about what helped. Disciplined iteration converges; random tweaking wanders. The structured walkthrough in Walk Text Through a Triple-Producing Extraction Pipeline shows where each variable lives.
Keep Failing Cases as Regression Tests
Every time you fix a failure, add that case to a test set. As you tune the prompt you will reintroduce old bugs; a regression set catches them. Your test suite is the memory of every mistake you have already solved.
Record Provenance and Confidence
Attach Source to Every Triple
Beyond the source span used for verification, store which document and chunk produced each triple. When two sources disagree—one article says a company has three founders, another says two—provenance lets you resolve the conflict by examining origins rather than guessing. A graph without provenance is a graph you cannot audit or trust under scrutiny.
Capture Uncertainty Honestly
When the source hedges—"may reduce," "is believed to," "preliminary results suggest"—instruct the model to record that uncertainty in a confidence or status field rather than flattening it into a firm assertion. A graph that distinguishes established facts from tentative ones is far more useful than one that treats every claim as certain, a distinction that proved decisive in Three Real Extraction Jobs, From Contracts to Clinical Notes.
Match Effort to Stakes
Spend Where Errors Are Expensive
Not every project needs maximal rigor. A throwaway analysis can skip the gold-standard set; a compliance or medical graph cannot. Decide early how costly a wrong edge is, then invest in grounding, verification, and measurement proportionally. The practices here are a toolkit, and the skill is applying the right subset to your actual risk profile rather than ritually doing everything everywhere.
Right-Size the Schema to the Stakes
High-stakes graphs benefit from narrower, more precisely defined schemas, because every additional type is another surface for error. Low-stakes exploratory graphs can afford a broader schema that casts a wide net, since the cost of a stray edge is low. Let the consequences of being wrong dictate how tight you draw the vocabulary, rather than reaching for the same breadth on every project.
Keep Prompts Maintainable
Separate the Stable Scaffold From Variable Inputs
Structure the prompt so the durable parts—schema, grounding rule, output contract, worked example—stay fixed while the document text is the only thing that changes per call. This separation makes the prompt easy to version, review, and improve, the same modular discipline that makes a step-by-step extraction pipeline maintainable as it grows.
Version Your Prompts Alongside Your Schema
When you change a relation definition or tighten the grounding rule, record it as a versioned change tied to the schema it belongs to. A graph assembled under prompt version two may differ subtly from version one, and knowing which version produced which triples is essential when you debug a quality regression months later.
Frequently Asked Questions
What is the single most important practice?
Leading with a closed, operationally defined schema. Nearly every downstream problem—label drift, fragmentation, inconsistent output—traces back to an absent or vague schema. Get the schema right and most other practices become refinements rather than rescues.
How strict should my output format really be?
Very strict. Specify exact JSON structure and field names, require only valid JSON with no commentary, and validate on receipt. Looseness here is the difference between an automated pipeline and one that needs constant manual cleanup.
Should I always require source spans?
For any extraction you intend to trust, yes. Source spans discourage fabrication and make verification mechanical. The token cost is small relative to the confidence they provide. Drop them only for throwaway experiments.
When is inference acceptable in extraction?
Only when inferring relationships is an explicit goal, and even then keep it a separate, clearly labeled task. Inferred edges mixed silently with stated ones corrupt analysis, because consumers cannot tell which facts are grounded and which are guesses.
How do I balance precision and recall?
Decide which error is costlier for your use case. A compliance graph favors precision; a discovery tool may tolerate lower precision for higher recall. Tune the grounding rule and schema breadth toward that balance, and measure both metrics every iteration.
Why iterate on only one variable at a time?
Because changing several elements at once hides which change caused which effect. Isolating one variable per iteration tells you exactly what helped, so your prompt improves predictably instead of drifting through trial and error.
Key Takeaways
- Consistency is what makes a graph useful, and consistency comes from constraint applied at every stage.
- Lead with a closed schema whose relations are defined operationally—when each applies, not just its name.
- Ground every triple with a required source span and forbid inference unless it is an explicit, separate goal.
- Engineer a strict, parseable output contract and validate it on receipt rather than later.
- Treat entity resolution as core infrastructure, anchoring to a reference list whenever one exists.
- Build a gold-standard set early, track precision and recall together, and iterate on one variable at a time with a growing regression set.