Pull a price, a date, and a vendor name out of an invoice and you face a deceptively simple question: how should you ask the model to do it? You can write a loose natural-language instruction, hand the model a strict schema, lean on a handful of worked examples, or fine-tune a model on labeled documents. Each path produces working output on a demo. The differences only show up later, when volume climbs, document formats drift, and someone has to maintain the thing.
The mistake teams make is treating this as a search for the single best technique. There is no best technique. There is only the approach that fits the accuracy you need, the budget you have, and the rate at which your inputs change. A one-off research task and a pipeline processing fifty thousand contracts a month deserve completely different answers, even when the extraction looks identical on the surface.
This article lays out the real competing approaches, names the axes that actually move the decision, and gives you a rule you can apply without re-litigating the question every time.
The Approaches Actually in Play
Most extraction work falls into one of four buckets, and conflating them is where the trouble starts.
Loose natural-language prompting
You describe what you want in plain English: "Pull out the total amount, the invoice date, and the supplier." Fast to write, forgiving of weird inputs, but the output shape wanders. You get "Jan 3, 2024" one call and "2024-01-03" the next. Good for exploration, poor for anything downstream code has to parse.
Schema-constrained prompting
You define the exact output structure — field names, types, allowed values — and force the model to fill it. Structured outputs, JSON mode, and tool-call schemas all live here. You trade a little flexibility for output you can validate and ingest without defensive parsing.
Few-shot example prompting
You show the model two to five worked examples of input and correct extraction, then give it the real input. This pulls accuracy up sharply on ambiguous fields and unusual formats, at the cost of longer prompts and the work of curating examples.
Fine-tuned extraction
You collect labeled documents and train a model specifically for your extraction task. Highest ceiling on accuracy and lowest per-call cost at scale, but it carries data-labeling effort, training overhead, and a model that quietly goes stale when your documents change.
The Axes That Actually Decide It
Pick the wrong axis to optimize and you will engineer a solution to a problem you do not have.
Accuracy tolerance
The first question is how much a wrong field costs. Extracting tags for a content recommender tolerates noise; extracting payment amounts for an accounting system does not. High-stakes fields push you toward schema constraints plus validation and, eventually, fine-tuning. Low-stakes fields are fine with loose prompting.
Volume and cost
At ten documents a week, per-call token cost is irrelevant and engineering time dominates — use the simplest thing that works. At a million documents a month, the prompt length you accepted casually now drives a five-figure bill, and a fine-tuned model with a short prompt pays for itself.
Input variability
Inputs that arrive in two clean formats are easy to schema-constrain. Inputs that arrive in two hundred messy formats reward few-shot prompting and, past a threshold, fine-tuning on representative samples. Measure your format spread before committing.
Maintenance burden
Someone owns this after launch. Prompts are editable by anyone who can read. Fine-tuned models require a retraining loop, a labeled dataset, and a person who understands it. Count that cost honestly — it is the one teams forget.
Latency and throughput
A few approaches differ sharply in speed. Few-shot prompts with long examples and multi-pass verification add latency per document, which is irrelevant for an overnight batch but disqualifying for an interactive flow where a user waits on the result. A short schema-constrained prompt against a fast model wins when responsiveness matters. Weigh this axis only when timing is a real constraint, but weigh it honestly when it is — an approach that is accurate but too slow for its use case is the wrong choice regardless of its numbers.
A Decision Rule You Can Reuse
Run the candidate task through these questions in order and stop at the first clear answer.
- Is this a one-off or low-volume task? Use schema-constrained prompting. Skip examples and fine-tuning entirely. The engineering time saved dwarfs any marginal accuracy.
- Is volume high but accuracy tolerant? Schema-constrained prompting with a short prompt. Optimize token cost, accept minor errors, add light validation.
- Is accuracy critical and inputs varied? Schema constraints plus few-shot examples chosen to cover your hard cases. Measure field-level accuracy before scaling.
- Is volume high, accuracy critical, and the input distribution stable enough to label? Fine-tune. The per-call savings and accuracy ceiling justify the labeling investment only here.
The honest default for most teams is the third option — schema plus a few examples — because it captures most of the accuracy gain without the maintenance tail of a trained model. Reach for fine-tuning when you have evidence, not a hunch, that prompting has plateaued below your target.
Where Teams Get the Trade-off Wrong
Two failure patterns recur. The first is premature fine-tuning: a team labels thousands of documents and trains a model before they have proven that good schema-constrained prompting falls short. They spend weeks recovering accuracy they could have had in an afternoon. The second is permanent prototyping: a loose prompt that worked in a demo gets shipped, and six months later half the engineering team is writing parsers to clean up output that a schema would have made clean from the start.
The way out of both is to treat the choice as reversible and evidence-driven. Start with the cheapest approach that could plausibly meet your accuracy bar, measure field-level performance on real inputs, and only escalate when the numbers say you must. If you are still proving the basic loop works, begin with Your Fastest Credible Path to a First Extraction Result. Pair the decision with How to Measure Prompting for Data Extraction: Metrics That Matter so the escalation rests on data, and keep The Complete Guide to Prompting for Data Extraction close for the underlying techniques.
Frequently Asked Questions
Should I always use a strict schema?
For anything that feeds downstream code, yes — schema constraints turn parsing from a guessing game into a validation step. The exception is open exploration, where you do not yet know what fields exist and a loose prompt helps you discover the shape of the data.
When is fine-tuning actually worth it?
Only when three things are true at once: high volume, demanding accuracy, and a stable enough input distribution that you can build a labeled dataset that stays representative. If any one is missing, prompting almost always wins on total cost.
Can I mix approaches?
Yes, and you often should. A common pattern is schema constraints for output shape plus a few examples for the genuinely ambiguous fields, leaving the easy fields to the schema alone. Mixing lets you spend complexity only where it earns its keep.
How do I know when prompting has plateaued?
Track field-level accuracy as you add examples and refine instructions. When two or three rounds of prompt improvement stop moving the number, you have likely hit the ceiling of what prompting can give you on that input distribution — that is your signal to consider fine-tuning.
Key Takeaways
- There is no single best extraction approach; the right one depends on accuracy tolerance, volume, input variability, and maintenance burden.
- The four real options are loose prompting, schema constraints, few-shot examples, and fine-tuning — each with a distinct cost profile.
- Schema constraints plus a few targeted examples is the honest default for most production work.
- Fine-tune only with evidence that prompting has plateaued and the input distribution is stable enough to label.
- Treat the choice as reversible: start cheap, measure field-level accuracy on real inputs, and escalate only when the numbers demand it.