Knowledge-graph extraction fails in quiet ways. The model returns confident, well-formatted output, you load it into a graph, and only weeks later do you notice that half the entities are missing, relationships are mislabeled, and a handful of edges describe facts that appear nowhere in the source. By then the bad data is tangled into everything downstream.
The reassuring news is that these failures are predictable. They cluster into a small set of recurring mistakes, almost all rooted in prompt design rather than model capability. Once you can name them, you can prevent them. This article walks through seven of the most damaging, explaining why each happens, what it costs, and the specific corrective practice.
If you are building extraction you intend to trust, treat this as a pre-mortem. Each mistake below has cost real teams real rework. Knowing them in advance is far cheaper than discovering them in production.
Mistake 1: No Schema, So Labels Drift
Why It Happens
A prompt that says "extract entities and relationships" leaves the model to invent labels. It will, and it will invent different ones each time—"founded," "co-founded," "established"—for what should be a single relationship.
The Cost and the Fix
Your graph fragments into synonym nodes and edges that never join, so queries miss most of what they should find. The fix is a closed schema: enumerate the exact entity and relation types, define each in a line, and instruct the model to use only those. This discipline anchors the entire step-by-step extraction process.
Mistake 2: Letting the Model Infer Beyond the Text
Why It Happens
Models are trained to be helpful, so they fill gaps. Tell them to find employment relationships and they will infer that a quoted executive is employed by the company even when the text never says so.
The Cost and the Fix
Your graph fills with plausible fabrications that look like facts and corrupt any analysis built on them. The fix is an explicit grounding rule: extract only relationships actually stated in the text, and omit anything that requires inference. Pair it with a required source span so every edge points at its evidence.
Mistake 3: No Output Contract, So Parsing Breaks
Why It Happens
Without a precise format instruction, the model wraps its answer in commentary, uses inconsistent field names, or mixes prose with data. Each response looks reasonable but differs structurally from the last.
The Cost and the Fix
Your parser fails intermittently, and you lose data or spend hours writing brittle cleanup logic. The fix is a strict output contract: specify exact JSON structure and field names, demand only valid JSON with no surrounding text, and validate immediately on receipt.
Mistake 4: Ignoring Entity Resolution
Why It Happens
The same entity appears under many surface forms—"IBM," "International Business Machines," "the company." Each extraction call sees only its local text and produces whatever form the text used.
The Cost and the Fix
Your graph holds three nodes for one company, so the facts about it never connect and queries return a fraction of the truth. The fix is a dedicated resolution pass that canonicalizes names and merges variants into a single node, as detailed in Turning Unstructured Text Into Connected Entity Graphs.
Mistake 5: Splitting Documents Without Overlap
Why It Happens
Long documents must be chunked to fit the context window, and the easiest split is a hard cut at a fixed length. Relationships described across that cut get severed.
The Cost and the Fix
Facts that span a boundary vanish from your graph—silent recall loss you will not notice without measurement. The fix is overlapping chunks by a sentence or two and, where needed, a linking pass that connects entities across chunk boundaries using consistent identifiers.
Mistake 6: Trusting Output Without Measurement
Why It Happens
The output is fluent and well-formatted, so it feels correct. Teams skip building a gold-standard set because it takes effort, then have no way to know how much they are missing.
The Cost and the Fix
You ship a graph with unknown precision and recall and discover the gaps only when downstream results look wrong. The fix is a small gold-standard set of manually labeled documents and routine precision and recall measurement, the same discipline behind Ship-Ready Verification Steps for Graph Extraction Prompts.
Mistake 7: An Overstuffed Schema From Day One
Why It Happens
Enthusiasm leads teams to extract every entity and relation type imaginable in the first prompt. The model now juggles dozens of types and applies them inconsistently.
The Cost and the Fix
Accuracy drops across the board because attention is spread thin, and debugging becomes impossible because everything is failing a little. The fix is to scope the schema to the questions the graph must answer, start narrow, and expand deliberately once the core extraction is reliable—an approach reinforced throughout Entities, Relations, and Triples: Graph Extraction From Scratch.
Catching These Before Production
Run a Pre-Mortem on Your Prompt
Before scaling, read your prompt against this list. Does it have a closed schema? A grounding rule? A strict output contract? A plan for entity resolution and chunk overlap? A gold-standard set? Each missing element is a known failure waiting to happen.
Test on Adversarial Inputs
Deliberately feed the prompt hard cases: ambiguous pronouns, entities under multiple names, relationships split across sentences, and facts that are implied but not stated. How the prompt handles these reveals which mistakes still lurk.
How These Mistakes Compound
Small Errors Multiply at Scale
Each mistake on its own is survivable on a handful of documents. The danger is that they compound. Label drift fragments entities, missing resolution multiplies those fragments, and no measurement means you never see the damage accumulating. By the time the graph reaches a few thousand documents, a 5 percent error per stage has cascaded into a graph where most queries return partial or wrong answers. Catching the mistakes early, before scale amplifies them, is the entire point of treating extraction as engineering rather than a one-off prompt.
Why Fluency Hides the Damage
The reason these errors stay invisible is that the model's output never looks broken. Fabricated edges read like real facts, drifted labels look like reasonable word choices, and dropped entities leave no trace at all. Nothing in the output signals a problem; only querying the assembled graph or measuring against a gold set reveals it. This is why teams that rely on eyeballing a few examples ship broken graphs—the surface always looks fine. The discipline of measurement is what pierces the illusion, the same lesson dramatized in How a Research Team Mapped 4,000 Papers Into One Graph.
Frequently Asked Questions
Why does my extraction look great in demos but fail at scale?
Demos use short, clean text where most failure modes do not surface. At scale you hit long documents, duplicate entities, inconsistent labels, and edge cases. The fix is to test on realistic, messy, lengthy inputs and to measure precision and recall rather than eyeballing a few examples.
How do I know if the model is inventing relationships?
Require a source span for every triple and verify a sample against the text. If a triple's span does not actually express the relationship, the model inferred or invented it. A grounding rule that forbids inference dramatically reduces these fabricated edges.
Is a missing schema really that damaging?
Yes. Without a closed schema the model invents inconsistent labels, and your graph fragments into synonym nodes and edges that never connect. It is the single most common and most destructive mistake in extraction prompts.
Should I worry about entity resolution from the start?
Plan for it from the start even if you implement it as a later pass. Duplicate entities are guaranteed the moment you process more than one document, and they quietly halve your graph's usefulness. A resolution step is not optional at scale.
How big does my gold-standard set need to be?
Even a handful of carefully labeled documents is enough to start measuring precision and recall and to catch systematic errors. Grow it as you encounter new edge cases. A small, honest gold set beats no measurement by a wide margin.
Can a better model eliminate these mistakes?
A stronger model helps at the margins, but most of these failures come from prompt design—missing schemas, no grounding rule, no output contract, no resolution. Fix the prompt and process first; model upgrades give diminishing returns on problems the prompt created.
Key Takeaways
- Extraction usually fails quietly: well-formatted output hides missing entities, drifting labels, and fabricated edges.
- The most damaging mistake is no closed schema, which fragments the graph into synonyms that never join.
- Let the model infer beyond the text and you fill the graph with plausible fabrications; a grounding rule plus source spans prevents this.
- Skipping the output contract, entity resolution, or chunk overlap each causes silent data loss at scale.
- Trusting fluent output without precision and recall measurement means shipping a graph with unknown quality.
- An overstuffed schema spreads attention thin; scope to the questions you must answer and expand deliberately.