Pulling Clean Structured Data Out of Language Models

When teams start using language models to pull information out of documents, emails, transcripts, or messy spreadsheets, the same handful of questions surface again and again. Why does the model sometimes invent fields that were never in the source? How do you guarantee the output is valid JSON every time? What do you do when a document has the data you need but expresses it in three different ways?

These are not abstract concerns. They are the practical obstacles that stand between a promising demo and a production pipeline that processes thousands of records without a human babysitting it. The answers are knowable, and most of them come down to a few well-understood techniques rather than secret tricks.

This article collects the questions we hear most often from teams building extraction workflows and answers each one directly. Where a topic deserves deeper treatment, you will find links to companion pieces that go further.

What Exactly Counts as Data Extraction Prompting?

Data extraction prompting is the practice of instructing a language model to read unstructured or semi-structured input and return specific, structured output. The input might be a contract, a customer support ticket, a PDF invoice, or a block of free-text notes. The output is usually a defined set of fields: names, dates, amounts, categories, or relationships.

How is it different from summarization?

Summarization asks the model to compress meaning. Extraction asks it to locate and transcribe specific values without interpretation. A summary of an invoice might say "a recent purchase of office supplies." An extraction would return the vendor, line items, subtotal, tax, and total exactly as they appear. The distinction matters because extraction tasks demand fidelity to the source, while summaries tolerate paraphrase.

Does the source need to be clean?

No, and that is the point. Models handle inconsistent formatting, typos, and varied phrasing far better than regular expressions or rigid parsers. The tradeoff is that you give up the deterministic guarantees of traditional parsing, which is why validation matters so much.

Why Does the Model Make Up Data That Is Not There?

Fabricated values, often called hallucinations, are the single biggest fear teams bring to extraction work. The model returns a phone number or a date that looks plausible but appears nowhere in the source.

What causes it?

Most fabrication happens for two reasons. First, the prompt implies that every field must be filled, so the model fills them rather than leaving blanks. Second, the field is genuinely absent and the model defaults to its training priors instead of admitting the gap.

How do you reduce it?

Explicitly permit null values. Tell the model to return an empty string or null when a field is not present in the source.
Ask the model to quote the supporting text. Requiring a short verbatim snippet for each extracted value forces the model to ground its answer in the document.
Keep the input window focused. Long, irrelevant context increases the odds the model pulls from the wrong place.

The 7 Common Mistakes with Prompting for Data Extraction (and How to Avoid Them) piece covers fabrication patterns in more depth, including the subtle ones that pass casual review.

How Do I Get Reliable JSON Output?

Structured output is the workhorse format for extraction because downstream systems can consume it directly. But early attempts often produce JSON wrapped in prose, trailing commas, or fields that drift from the schema.

Should I describe the schema in the prompt or use a tool?

Both approaches work, and the best one depends on your model and platform. Describing the schema inline, with an example, gives the model a clear target. Many modern model APIs also offer a structured-output or function-calling mode that constrains generation to a schema you define. When that mode is available, prefer it, because it eliminates an entire class of parsing failures.

What if I cannot use a constrained mode?

Provide one complete example of valid output, specify that the response must be JSON and nothing else, and validate the result programmatically. If parsing fails, retry with the error message appended to the prompt. The Best Practices That Actually Work guide walks through a resilient parse-and-retry loop.

How Many Examples Should I Include in the Prompt?

The question of zero-shot versus few-shot is one of the most practical you will face.

When is zero-shot enough?

For simple, common field types such as dates, names, and amounts, a clear instruction with a schema often performs well with no examples at all. Start here, because every example you add costs tokens and latency.

When do examples help?

Add examples when the task has ambiguity that instructions alone cannot resolve. If you need a specific date format, a particular categorization scheme, or consistent handling of edge cases, one or two well-chosen examples teach the pattern faster than paragraphs of rules. Choose examples that cover the tricky cases, not the obvious ones.

How Do I Handle Documents Longer Than the Context Window?

Long documents create two problems: they may exceed the model's input limit, and even within the limit, relevant details can get lost in a sea of irrelevant text.

What is the standard approach?

Chunk the document into sections, extract from each chunk, then merge the results. For documents with clear structure, split on natural boundaries such as headings or pages. For continuous text, use overlapping windows so a value that straddles a boundary is not cut in half.

How do I avoid duplicates after merging?

Deduplicate on a stable key, such as an identifier or a normalized combination of fields. When the same value appears in multiple chunks, keep the one with the strongest supporting context. The Real-World Examples and Use Cases article shows a full chunk-and-merge flow on a real contract set.

How Do I Know the Extraction Is Accurate?

Trusting an extraction pipeline requires measurement, not optimism.

What should I measure?

Build a labeled evaluation set of representative documents with known correct answers. Then measure precision (of the values the model returned, how many were correct) and recall (of the values that should have been found, how many the model caught). These two numbers tell you whether your pipeline tends toward fabrication or omission.

How big does the eval set need to be?

Even fifty carefully labeled documents reveal most systematic problems. The goal is not statistical perfection but catching the failure modes that matter for your use case. Re-run the eval whenever you change the prompt or switch models.

Frequently Asked Questions

Can I extract data from images and scanned PDFs?

Yes, using models with vision capability or by running optical character recognition first and then extracting from the resulting text. Vision-capable models often handle layout-heavy documents like tables and forms better than a text-only pipeline, because they preserve spatial relationships that plain OCR loses.

Is a language model better than regular expressions for extraction?

It depends on the input. For rigidly formatted data where every record looks identical, regular expressions are faster, cheaper, and deterministic. For varied, messy, or natural-language input, models are far more robust. Many strong pipelines combine the two, using regex for the predictable parts and models for the rest.

How do I keep extraction costs under control?

Use the smallest model that meets your accuracy bar, trim irrelevant context before sending input, prefer zero-shot when it works, and cache results for documents you may process more than once. Batch processing, where supported, also reduces per-record overhead.

What is the best way to handle fields that are sometimes missing?

Design your schema to allow nulls, instruct the model to leave absent fields empty rather than guessing, and treat a high rate of nulls as a signal to inspect your source data rather than a failure of the prompt.

Should I trust extraction output without human review?

For high-stakes use cases, route low-confidence results to human review and let high-confidence results flow through automatically. A confidence signal, such as whether the model could quote supporting text, makes this triage practical.

Key Takeaways

Extraction prompting transcribes specific values with fidelity, unlike summarization, which compresses meaning.
Fabrication usually comes from prompts that force every field to be filled; permit nulls and require supporting quotes.
Prefer constrained structured-output modes for reliable JSON, and validate with a parse-and-retry loop when you cannot.
Start zero-shot and add examples only to resolve genuine ambiguity around format or categorization.
Chunk long documents, merge with stable deduplication keys, and measure precision and recall on a labeled eval set before trusting the pipeline.

What Exactly Counts as Data Extraction Prompting?

How is it different from summarization?

Does the source need to be clean?

Why Does the Model Make Up Data That Is Not There?

What causes it?

How do you reduce it?

Explicitly permit null values. Tell the model to return an empty string or null when a field is not present in the source.
Ask the model to quote the supporting text. Requiring a short verbatim snippet for each extracted value forces the model to ground its answer in the document.
Keep the input window focused. Long, irrelevant context increases the odds the model pulls from the wrong place.

The 7 Common Mistakes with Prompting for Data Extraction (and How to Avoid Them) piece covers fabrication patterns in more depth, including the subtle ones that pass casual review.

How Do I Get Reliable JSON Output?

Should I describe the schema in the prompt or use a tool?

What if I cannot use a constrained mode?

How Many Examples Should I Include in the Prompt?

The question of zero-shot versus few-shot is one of the most practical you will face.

When is zero-shot enough?

When do examples help?

How Do I Handle Documents Longer Than the Context Window?

Long documents create two problems: they may exceed the model's input limit, and even within the limit, relevant details can get lost in a sea of irrelevant text.

What is the standard approach?

How do I avoid duplicates after merging?

How Do I Know the Extraction Is Accurate?

Trusting an extraction pipeline requires measurement, not optimism.

What should I measure?

How big does the eval set need to be?

Frequently Asked Questions

Can I extract data from images and scanned PDFs?

Is a language model better than regular expressions for extraction?

How do I keep extraction costs under control?

What is the best way to handle fields that are sometimes missing?

Should I trust extraction output without human review?

Key Takeaways

Extraction prompting transcribes specific values with fidelity, unlike summarization, which compresses meaning.
Fabrication usually comes from prompts that force every field to be filled; permit nulls and require supporting quotes.
Prefer constrained structured-output modes for reliable JSON, and validate with a parse-and-retry loop when you cannot.
Start zero-shot and add examples only to resolve genuine ambiguity around format or categorization.
Chunk long documents, merge with stable deduplication keys, and measure precision and recall on a labeled eval set before trusting the pipeline.

Pulling Clean Structured Data Out of Language Models

What Exactly Counts as Data Extraction Prompting?

How is it different from summarization?

Does the source need to be clean?

Why Does the Model Make Up Data That Is Not There?

What causes it?

How do you reduce it?

How Do I Get Reliable JSON Output?

Should I describe the schema in the prompt or use a tool?

What if I cannot use a constrained mode?

How Many Examples Should I Include in the Prompt?

When is zero-shot enough?

When do examples help?

How Do I Handle Documents Longer Than the Context Window?

What is the standard approach?

How do I avoid duplicates after merging?

How Do I Know the Extraction Is Accurate?

What should I measure?

How big does the eval set need to be?

Frequently Asked Questions

Can I extract data from images and scanned PDFs?

Is a language model better than regular expressions for extraction?

How do I keep extraction costs under control?

What is the best way to handle fields that are sometimes missing?

Should I trust extraction output without human review?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Pulling Clean Structured Data Out of Language Models

What Exactly Counts as Data Extraction Prompting?

How is it different from summarization?

Does the source need to be clean?

Why Does the Model Make Up Data That Is Not There?

What causes it?

How do you reduce it?

How Do I Get Reliable JSON Output?

Should I describe the schema in the prompt or use a tool?

What if I cannot use a constrained mode?

How Many Examples Should I Include in the Prompt?

When is zero-shot enough?

When do examples help?

How Do I Handle Documents Longer Than the Context Window?

What is the standard approach?

How do I avoid duplicates after merging?

How Do I Know the Extraction Is Accurate?

What should I measure?

How big does the eval set need to be?

Frequently Asked Questions

Can I extract data from images and scanned PDFs?

Is a language model better than regular expressions for extraction?

How do I keep extraction costs under control?

What is the best way to handle fields that are sometimes missing?

Should I trust extraction output without human review?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?