This is the story of a delivery team that shipped an AI feature, watched it work most of the time, and slowly realized that "most of the time" was costing them more than they understood. The arc runs from the first confusing failure through the practice they built to the measurable change that followed. Names and specifics are generalized to keep the narrative useful rather than to report on any single client.
The value of a case study is in the decisions, not the outcome. A robustness rate going up is satisfying, but the interesting part is what the team chose to do at each fork, and why. We will move through the situation, the decision, the execution, and the result, then pull out the transferable lessons.
If you want the underlying method this team eventually adopted, it mirrors the sequence in Build a Repeatable Robustness Test in One Afternoon. Here we watch it land in a real workflow.
The Situation: Intermittent Failures Nobody Could Reproduce
The team had built an extraction pipeline that pulled structured data from client documents and fed it into a downstream system. In development it was solid. In production, roughly one document in twenty came back malformed.
The Symptom Was Maddening
The failures were not consistent. The same document that failed at 9 a.m. sometimes succeeded on a retry. Engineers could not reproduce the issue on demand, which made it easy to dismiss as flaky infrastructure rather than a prompt problem.
The Cost Was Hidden
Each malformed extraction triggered a manual cleanup by an account manager. Individually small, the cleanups added up to several hours a week and a quiet erosion of trust in the feature. No one had connected those hours to the prompt.
The Decision: Treat It as Sensitivity, Not Bad Luck
The turning point came when a developer noticed that retries succeeded more often when the document was short. That hint reframed the problem from random failure to prompt sensitivity.
Reframing Changed the Investigation
Once the team hypothesized that input characteristics drove the failures, the path forward was clear: build a benchmark of real documents, including the ones that had failed, and test systematically. They decided to invest a few days in a proper robustness test rather than continue patching symptoms.
Buy-In Came From the Cost Math
The account manager's cleanup hours, finally tallied, justified the investment easily. Framing robustness work in terms of recovered hours, not technical elegance, secured the time to do it right.
The Execution: Building the Test
The team assembled a benchmark of sixty real documents spanning short, long, clean, and messy cases, deliberately including every document that had failed in production.
Defining Correctness First
Before running anything, they wrote an explicit success criterion: valid JSON, all required fields present, no fields invented from absent data. Because the criterion was machine-checkable, scoring sixty documents across several prompt variations was fast.
Varying One Dimension at a Time
They generated variations that isolated single changes β instruction position, format anchoring, delimiter style β so each failure pointed at a specific cause. This discipline, central to Opinions Earned the Hard Way on Prompt Robustness, kept the diagnosis clean.
The Findings
Two fragilities emerged. First, long documents pushed the extraction instruction's influence down, mirroring the classifier pattern in Six Real Scenarios Where a Tiny Edit Broke the Output. Second, at production temperature the model occasionally wrapped its JSON in explanatory prose.
The Result: A Measurable Stabilization
The fixes followed directly from the findings. The team moved the extraction instruction to follow the document text, anchored the output with an explicit JSON schema and example, and added labeled delimiters around the document section.
The Numbers Moved
On the sixty-document benchmark, the malformed rate fell from roughly one in twenty to under one in two hundred. In production over the following weeks, the account manager's cleanup time dropped to near zero. The recovered hours paid back the testing investment within the first month.
The Practice Outlasted the Fix
More durably, the team kept the benchmark. When the model provider pushed an update months later, a scheduled re-run flagged a small regression before any client saw it. The test had become an instrument, not a one-time gate β the standing discipline described in The Prompt Sensitivity and Robustness Testing Checklist for 2026.
What Almost Went Wrong Along the Way
The clean arc above hides two near-misses that are themselves instructive, because they show how easily a robustness effort can produce a false result.
The First Benchmark Was Too Clean
The team's initial benchmark draft consisted mostly of well-formed documents, because those were the easiest to collect from recent successful runs. Had they tested only against that set, the robustness rate would have looked reassuringly high and the real fragilities would have stayed hidden. A reviewer caught the omission and insisted the failed documents go in. That single intervention is what made the test honest, and it underscores why a benchmark must deliberately include the ugly cases rather than the convenient ones.
A Fix That Quietly Broke Something Else
When the team first anchored the output format with a strict schema, the change fixed the prose-wrapping problem but suppressed an optional field on a subset of documents. Because they re-ran the full benchmark rather than only the previously failing cases, the regression surfaced immediately. They relaxed the schema to mark the field optional and re-ran again. Without the full re-run, that regression would have shipped, trading one failure mode for another β the exact trap described in 7 Pitfalls That Quietly Wreck Robustness Testing.
Lessons Worth Transferring
The team's experience generalizes in a few specific ways that apply well beyond extraction pipelines.
Reframe Flakiness as Sensitivity
"Random" failures that correlate with input characteristics are usually prompt sensitivity in disguise. The reframe is what unlocks systematic investigation instead of endless retries.
Tie the Work to Recovered Cost
Robustness work competes for time against feature work. Quantifying the hidden cost β here, manual cleanup hours β is what justifies the investment to people who control the schedule.
Keep the Benchmark Alive
The largest long-term payoff was not the initial fix but the standing test that caught a silent model regression later. The benchmark, built once, kept earning.
Frequently Asked Questions
What was the single most important decision in this case?
Reframing the intermittent failures as prompt sensitivity rather than infrastructure flakiness. That shift redirected the team from fruitless retries toward a systematic benchmark, which is the only thing that surfaced the real causes. Everything downstream β the fixes, the recovered hours, the standing test β flowed from correctly naming the problem.
How did the team justify spending days on testing?
They tallied the hidden cost: the account manager's weekly cleanup hours triggered by malformed extractions. Once that cost was concrete, the few days of testing were trivially justified against the hours being lost every week. Framing robustness work in terms of recovered cost, rather than technical correctness, is what secured the time.
Why include the previously failed documents in the benchmark?
Those documents were proof of specific failure modes the team needed to fix and guard against. Including them ensured the fixes actually addressed real failures, not hypothetical ones, and that those exact failures could never silently return. Production failures are the highest-value benchmark inputs because reality already confirmed they are hard.
Could the fixes have been found without a structured test?
Possibly by luck, but not reliably. The structured test, with one-dimension-at-a-time variations, is what let the team attribute failures to specific causes β instruction position and temperature-driven format drift. Ad hoc patching might have stumbled onto one fix while missing the other, and would not have produced the reusable benchmark that later caught the model regression.
What made the benchmark valuable after the initial fix?
It became a standing instrument that ran on a schedule and after changes. When the model provider pushed a silent update, a scheduled re-run caught a regression before any client experienced it. That ongoing protection, not the one-time stabilization, was the benchmark's largest long-term payoff, and it cost almost nothing to maintain once built.
How transferable is this to non-extraction use cases?
Highly transferable, because the lessons are about method, not domain. Reframing correlated flakiness as sensitivity, tying the work to recovered cost, and keeping the benchmark alive apply to classifiers, summarizers, agents, and more. The specific fragilities differ by task, but the practice of benchmarking, diagnosing, and re-testing carries across all of them.
Key Takeaways
- Intermittent "random" failures that correlate with input characteristics are usually prompt sensitivity, and reframing them that way unlocks systematic investigation.
- Quantifying the hidden cost of failures β here, weekly manual cleanup hours β is what justifies investing time in robustness work.
- A machine-checkable success criterion let the team score sixty documents across variations quickly and attribute failures to specific causes.
- Targeted fixes β instruction repositioning, format anchoring, explicit delimiters β cut the malformed rate roughly tenfold and erased the cleanup burden.
- The benchmark's largest payoff was long-term: a scheduled re-run later caught a silent model regression before any client was affected.