How One Team Tamed a Flaky Document Extraction Pipeline

This is the story of a single system: an internal tool at a mid-sized operations team that read incoming purchase orders—PDFs and emails in dozens of formats—and turned them into structured records for an order-management system. The names and exact numbers are generalized, but the arc is real and the decisions are the kind any team faces when structured output meets messy reality.

The point of a case study is not to admire a finished system. It is to watch the decisions unfold in sequence, including the wrong turns, so the reasoning transfers. We will move through the situation as the team found it, the choices they made, how they executed, what changed measurably, and what they would tell their past selves.

For the principles underneath these decisions, the best practices guide is the reference. Here the focus is the journey.

The Situation

The team had shipped a first version quickly. It prompted a model with "extract the order details as JSON" and parsed whatever came back. In the demo it was magic. In production it was a slow source of pain.

The Symptoms

Roughly one order in twenty failed to process, usually because the model wrapped its JSON in an explanation or omitted a field the downstream system required.
Failures were silent. A malformed response threw an error that got logged and forgotten, so orders simply vanished until a customer called asking where their shipment was.
Quantities and prices were occasionally transposed or miscalculated when the model tried to compute order totals.

The first version had none of the four layers a robust pipeline needs—no schema enforcement, no validation, no retry path, no measurement. It worked until it did not.

The Decision

The team's lead made a call that shaped everything: treat the model's output as untrusted input and rebuild the pipeline around that assumption, rather than patching the prompt and hoping. This reframing turned a series of one-off fixes into a coherent design.

They committed to four changes, drawn straight from the step-by-step approach: a real schema, the strongest enforcement their provider offered, full validation, and a retry loop with logging.

The Execution

Defining the Schema

They wrote the order schema as a typed object in code—line items as an array, each with description, quantity, and unit price, plus order-level fields for vendor and date. Crucially, they removed the total fields from the model's responsibility. Totals would be computed in code from the extracted quantities and prices.

Uncertain fields became optional with explicit "null if absent" descriptions, which stopped the model from inventing vendor codes when an email did not include one.

Wiring Enforcement and Validation

They switched on their provider's strict schema mode, which immediately ended the "explanatory sentence before the JSON" failures. Then they added two validation passes: structural validation against the schema, and semantic checks—quantities positive, dates within a sane window, vendor present in their known-vendor list.

Building the Retry Loop

The biggest behavioral change was the retry loop. On any validation failure, the system re-called the model with the specific error appended: "Vendor 'ACME' is not in the approved list; re-check the document header." Most failures that survived enforcement resolved on this first informed retry. Exhausted retries routed to a human review queue instead of vanishing.

Adding Measurement

Finally, every failure and retry was logged with the field that failed. This was the unglamorous change that paid off most, because it turned debugging from guesswork into reading a table.

The Outcome

After the rebuild, the measured picture changed in concrete ways:

The silent-failure problem disappeared entirely. Nothing vanished; unresolvable cases landed in the human queue where they were visible and handled.
The malformed-JSON failures dropped to near zero once strict enforcement and informed retries were in place.
The arithmetic errors vanished, because the model no longer did arithmetic—code did.
The weekly failure logs revealed that one vendor's invoices, with an unusual layout, accounted for a large share of the remaining retries. The team wrote a targeted field description for that layout and removed most of those too.

The system went from a quiet liability to a dependable part of the workflow, and the operations staff stopped fielding "where is my order" calls caused by vanished records.

What It Cost to Get There

The rebuild was not free, and being honest about the cost is part of the lesson. The team spent roughly two weeks on the new pipeline, most of it not on the model call but on the surrounding machinery—the validation rules, the retry loop, the human-review queue, and the logging. The model integration itself was a small fraction of the work.

That ratio surprised them and tends to surprise most teams. The first version had been fast precisely because it was only the model call. The reliable version was slower to build because reliability lives in the layers around the call, not in the call itself. A useful rule of thumb emerged: budget the model integration as the easy 20 percent and the pipeline that makes it trustworthy as the hard 80 percent.

They also had to bring the operations staff into the design of the human-review queue, since those staff would now handle the cases the system could not. That collaboration turned out to be valuable—the staff knew which vendor layouts were unusual and which fields mattered most, knowledge that fed directly back into better field descriptions and validation rules. Reliability work that seemed purely technical ended up depending on domain expertise the engineers did not have.

The Lessons

Asked what they would tell their earlier selves, the team named three things. First, the prompt was never the real problem; the missing pipeline layers were. Second, measurement should have come first—they spent weeks guessing at failures the logs would have shown in a day. Third, taking arithmetic away from the model was the highest-leverage single change, eliminating a whole class of subtle errors for almost no effort.

None of these are surprising in hindsight. They map exactly onto the common mistakes that catch most teams. The value of watching them play out in one system is seeing how a single reframing—output is untrusted input—reorganizes a pile of symptoms into a clear plan.

Frequently Asked Questions

What was the single highest-impact change in this rebuild?

Removing arithmetic from the model's job. The team had the model extract raw quantities and prices and moved all total calculations into deterministic code. This eliminated an entire class of intermittent, hard-to-spot errors—transposed and miscalculated totals—for almost no engineering effort, which is why it ranked above even the schema enforcement change.

Why did silent failures cause more damage than visible ones?

Because a visible failure gets handled, while a silent one corrupts the workflow invisibly. When malformed responses threw errors that were logged and ignored, orders simply disappeared, and the problem only surfaced when a customer called. Routing unresolvable cases to a human queue made every failure visible and therefore actionable.

Did strict schema enforcement remove the need for validation?

No. Enforcement ended the malformed-JSON failures but did nothing for semantic correctness—whether a vendor was on the approved list or a date was sane. The team still needed validation for those business rules. Enforcement and validation solved different problems, and the rebuild needed both.

How did logging change the team's work?

It turned debugging from speculation into reading data. Once every failure was logged with the field that failed, the team could see that a single vendor's unusual layout caused a large share of remaining retries, and fix it precisely. Before logging, they were guessing at causes; after, they were reading them off a table.

Could they have fixed the original system by improving the prompt?

Only marginally. A better prompt would have reduced some failures but left the system without enforcement, validation, retries, or visibility, so the intermittent failures and silent drops would have persisted. The real fix was structural—adding the missing pipeline layers—not lexical. Prompt tuning was a small contributor at best.

Key Takeaways

The original system failed because it lacked all four pipeline layers, not because of a bad prompt.
Reframing model output as untrusted input turned scattered fixes into a coherent rebuild.
Strict enforcement ended malformed-JSON failures; validation handled the separate problem of business rules.
Removing arithmetic from the model eliminated a whole class of subtle errors for minimal effort.
Measurement should come first—logging revealed precise causes the team had spent weeks guessing at.

For the principles underneath these decisions, the best practices guide is the reference. Here the focus is the journey.

The Situation

The Symptoms

Roughly one order in twenty failed to process, usually because the model wrapped its JSON in an explanation or omitted a field the downstream system required.
Failures were silent. A malformed response threw an error that got logged and forgotten, so orders simply vanished until a customer called asking where their shipment was.
Quantities and prices were occasionally transposed or miscalculated when the model tried to compute order totals.

The first version had none of the four layers a robust pipeline needs—no schema enforcement, no validation, no retry path, no measurement. It worked until it did not.

The Decision

They committed to four changes, drawn straight from the step-by-step approach: a real schema, the strongest enforcement their provider offered, full validation, and a retry loop with logging.

The Execution

Defining the Schema

Uncertain fields became optional with explicit "null if absent" descriptions, which stopped the model from inventing vendor codes when an email did not include one.

Wiring Enforcement and Validation

Building the Retry Loop

Adding Measurement

Finally, every failure and retry was logged with the field that failed. This was the unglamorous change that paid off most, because it turned debugging from guesswork into reading a table.

The Outcome

After the rebuild, the measured picture changed in concrete ways:

The silent-failure problem disappeared entirely. Nothing vanished; unresolvable cases landed in the human queue where they were visible and handled.
The malformed-JSON failures dropped to near zero once strict enforcement and informed retries were in place.
The arithmetic errors vanished, because the model no longer did arithmetic—code did.
The weekly failure logs revealed that one vendor's invoices, with an unusual layout, accounted for a large share of the remaining retries. The team wrote a targeted field description for that layout and removed most of those too.

The system went from a quiet liability to a dependable part of the workflow, and the operations staff stopped fielding "where is my order" calls caused by vanished records.

What It Cost to Get There

The Lessons

Frequently Asked Questions

What was the single highest-impact change in this rebuild?

Why did silent failures cause more damage than visible ones?

Did strict schema enforcement remove the need for validation?

How did logging change the team's work?

Could they have fixed the original system by improving the prompt?

Key Takeaways

The original system failed because it lacked all four pipeline layers, not because of a bad prompt.
Reframing model output as untrusted input turned scattered fixes into a coherent rebuild.
Strict enforcement ended malformed-JSON failures; validation handled the separate problem of business rules.
Removing arithmetic from the model eliminated a whole class of subtle errors for minimal effort.
Measurement should come first—logging revealed precise causes the team had spent weeks guessing at.

How One Team Tamed a Flaky Document Extraction Pipeline

The Situation

The Symptoms

The Decision

The Execution

Defining the Schema

Wiring Enforcement and Validation

Building the Retry Loop

Adding Measurement

The Outcome

What It Cost to Get There

The Lessons

Frequently Asked Questions

What was the single highest-impact change in this rebuild?

Why did silent failures cause more damage than visible ones?

Did strict schema enforcement remove the need for validation?

How did logging change the team's work?

Could they have fixed the original system by improving the prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

How One Team Tamed a Flaky Document Extraction Pipeline

The Situation

The Symptoms

The Decision

The Execution

Defining the Schema

Wiring Enforcement and Validation

Building the Retry Loop

Adding Measurement

The Outcome

What It Cost to Get There

The Lessons

Frequently Asked Questions

What was the single highest-impact change in this rebuild?

Why did silent failures cause more damage than visible ones?

Did strict schema enforcement remove the need for validation?

How did logging change the team's work?

Could they have fixed the original system by improving the prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?