Rebuilding a Failing Report Pipeline by Splitting It Up

Frameworks and best practices are useful, but they hide the messy reality of applying them. To make decomposition concrete, here is the story of a single team working through a single hard problem, told as an arc: the situation they faced, the decisions they made, how they executed, what they measured, and what they learned.

The team was a four-person content operations group at a mid-sized agency. Their job was to produce client-facing monthly performance reports, each summarizing campaign data into a readable narrative with recommendations. They had automated the first draft with a single large prompt, and it was failing them in ways that were eroding trust with clients.

We have changed identifying details, but the technical arc is faithful to how the work actually unfolded. The point is not that their solution will be yours, but that their reasoning process is one you can borrow.

The Situation

The single-prompt report generator worked about sixty percent of the time. The other forty percent produced reports that mixed up metrics between campaigns, made recommendations that contradicted the data, or simply ran out of room and cut off mid-section.

Why the single prompt was failing

The prompt was asking the model to do everything at once: read raw campaign data, identify what mattered, write a narrative, and produce recommendations. Each of these is a distinct job, and crowding them into one call meant the model formed conclusions before it had finished reading the data.

The cost of the status quo

An editor had to review every report, and roughly half needed substantial rework. The automation that was supposed to save time was saving very little, and a few errors had reached clients, which was the real motivation to fix it.

The Decision

The team decided to decompose, but they made one disciplined choice first: they kept the single prompt running as a baseline so they could prove the new pipeline was actually better. This baseline discipline came straight from our best practices guide.

Mapping the failures to cuts

They studied the failures and found three distinct problems: data confusion, contradiction between narrative and data, and truncation. Each mapped to a different reasoning phase, which told them where to cut. Data confusion meant extraction needed isolation. Contradiction meant analysis needed to happen before writing. Truncation meant the output was too large for one pass.

Designing the pipeline

They settled on four steps: extract metrics into a structured table, analyze the table to identify findings, draft the narrative from the findings, and review for consistency. They resisted adding more steps, applying the principle of keeping the coarsest decomposition that works.

The Execution

Building the handoffs

The first version passed prose between steps and immediately hit trouble: the draft step missed findings buried in the analysis paragraph. They switched to structured handoffs, passing a JSON object of findings, each with its supporting metric, between analysis and drafting. The data confusion vanished.

Adding a validation checkpoint

The extraction step fed every downstream step, so it was the high-leverage boundary. They added a quick verification that checked the extracted table against the raw data before passing it forward. This single checkpoint caught most of the remaining errors before they could compound, a pattern explored in our common mistakes guide.

Tuning the recombination

The review step was redesigned to actively check that every recommendation traced back to a finding and that no metric was attributed to the wrong campaign. It became a real editing pass rather than a rubber stamp.

The Outcome

What they measured

They tracked the share of reports that passed editorial review without rework, comparing the pipeline against the single-prompt baseline over a full month of reports. The metrics they chose followed our metrics guide.

The result

The pipeline raised first-pass acceptance from roughly sixty percent to the low nineties. Editor time per report dropped by more than half because the remaining issues were minor polish rather than structural rework. No reports with the old confusion-or-contradiction errors reached clients during the measurement period.

The cost side

The pipeline used roughly four times the tokens of the single prompt and added a few seconds of latency per report. For client-facing work where errors damaged trust, that trade was easy to justify, a calculation we unpack in the trade-offs piece.

The Lessons

Failures are a map

The team's best move was reading their failures carefully before designing the pipeline. The failure pattern told them exactly where to cut, which kept them from over-decomposing or cutting in the wrong places.

The baseline kept them honest

Because they kept the single prompt running, they could prove the pipeline earned its added cost. Without that comparison, they would have assumed the pipeline helped without evidence.

Structured handoffs did the heavy lifting

The switch from prose to structured handoffs solved more problems than any individual prompt rewrite. It was the unglamorous plumbing decision that mattered most.

What the Team Would Do Differently

Instrument from the first version

The team added proper measurement only after the rebuild was mostly done, which meant their early comparisons against the baseline relied partly on impressions. In hindsight, they wished they had logged review outcomes and per-step state from the very first pipeline run. The instrumentation that proved the pipeline worked would have been more convincing if it had been in place from the start rather than bolted on.

Design the handoff before the prompts

Their first version sank time into wording each step's prompt before they had settled what the steps would pass to each other. The prose-handoff failure forced a rework that better sequencing would have avoided. The lesson they took was to design the handoff contracts first, then write the prompts to produce and consume those contracts, rather than the reverse.

Resist the urge to add a fifth step

Midway through, someone proposed a separate fact-checking step on top of the four. They held off, folding the check into the existing review step instead, and the pipeline stayed lean. They later agreed this restraint was correct: the four steps already mapped to the three failure types plus recombination, and a fifth would have added cost and a boundary without addressing any observed failure.

Frequently Asked Questions

Why keep the single prompt running after deciding to decompose?

Keeping the baseline let the team prove the pipeline was actually better rather than assuming it. It also gave them a fallback and a continuous comparison point as both the baseline model and the pipeline evolved. Abandoning the baseline would have meant flying blind on whether the added complexity was justified.

How did the team avoid over-decomposing?

They mapped each observed failure to a distinct reasoning phase and added a step only for phases that were genuinely failing. Three failure types meant a small number of steps, not a dozen. They explicitly resisted the urge to subdivide further until a specific remaining failure demanded it.

What was the single most impactful change?

Switching from prose handoffs to structured handoffs between steps. It eliminated the data-confusion errors that had been the most damaging failure, and it made the pipeline debuggable because they could inspect the structured object at each boundary. The wording of individual prompts mattered far less.

Was the four-times token cost worth it?

For this team it clearly was, because the reports were client-facing and errors damaged trust and required expensive editor rework. The token cost was trivial next to the editor time saved and the reputational risk avoided. The same trade might not pay off for low-stakes internal drafts.

How long did the rebuild take?

The core rebuild took a few focused days, most of which went into diagnosing the failures and designing the handoffs rather than writing prompts. The validation checkpoint and recombination tuning came in a second iteration after watching the first version run on real reports.

Could a better single prompt have solved this without decomposition?

The team tried, and a better single prompt improved things at the margin but never eliminated the truncation or the tendency to conclude before reading all the data. Those were structural limits of doing everything in one pass, which is precisely the situation where decomposition earns its keep.

Key Takeaways

Read your failures before designing a pipeline; the failure pattern tells you where to cut.
Keep the single-prompt baseline running so you can prove the pipeline actually earned its added cost.
Structured handoffs between steps eliminated the most damaging errors and made the pipeline debuggable.
A single validation checkpoint at the high-leverage extraction boundary stopped errors from compounding.
First-pass acceptance rose from about sixty to the low nineties, more than justifying a four-times token cost for client-facing work.

The Situation

Why the single prompt was failing

The cost of the status quo

The Decision

Mapping the failures to cuts

Designing the pipeline

The Execution

Building the handoffs

Adding a validation checkpoint

Tuning the recombination

The Outcome

What they measured

The result

The cost side

The Lessons

Failures are a map

The baseline kept them honest

Because they kept the single prompt running, they could prove the pipeline earned its added cost. Without that comparison, they would have assumed the pipeline helped without evidence.

Structured handoffs did the heavy lifting

The switch from prose to structured handoffs solved more problems than any individual prompt rewrite. It was the unglamorous plumbing decision that mattered most.

What the Team Would Do Differently

Instrument from the first version

Design the handoff before the prompts

Resist the urge to add a fifth step

Frequently Asked Questions

Why keep the single prompt running after deciding to decompose?

How did the team avoid over-decomposing?

What was the single most impactful change?

Was the four-times token cost worth it?

How long did the rebuild take?

Could a better single prompt have solved this without decomposition?

Key Takeaways

Read your failures before designing a pipeline; the failure pattern tells you where to cut.
Keep the single-prompt baseline running so you can prove the pipeline actually earned its added cost.
Structured handoffs between steps eliminated the most damaging errors and made the pipeline debuggable.
A single validation checkpoint at the high-leverage extraction boundary stopped errors from compounding.
First-pass acceptance rose from about sixty to the low nineties, more than justifying a four-times token cost for client-facing work.

Rebuilding a Failing Report Pipeline by Splitting It Up

The Situation

Why the single prompt was failing

The cost of the status quo

The Decision

Mapping the failures to cuts

Designing the pipeline

The Execution

Building the handoffs

Adding a validation checkpoint

Tuning the recombination

The Outcome

What they measured

The result

The cost side

The Lessons

Failures are a map

The baseline kept them honest

Structured handoffs did the heavy lifting

What the Team Would Do Differently

Instrument from the first version

Design the handoff before the prompts

Resist the urge to add a fifth step

Frequently Asked Questions

Why keep the single prompt running after deciding to decompose?

How did the team avoid over-decomposing?

What was the single most impactful change?

Was the four-times token cost worth it?

How long did the rebuild take?

Could a better single prompt have solved this without decomposition?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Rebuilding a Failing Report Pipeline by Splitting It Up

The Situation

Why the single prompt was failing

The cost of the status quo

The Decision

Mapping the failures to cuts

Designing the pipeline

The Execution

Building the handoffs

Adding a validation checkpoint

Tuning the recombination

The Outcome

What they measured

The result

The cost side

The Lessons

Failures are a map

The baseline kept them honest

Structured handoffs did the heavy lifting

What the Team Would Do Differently

Instrument from the first version

Design the handoff before the prompts

Resist the urge to add a fifth step

Frequently Asked Questions

Why keep the single prompt running after deciding to decompose?

How did the team avoid over-decomposing?

What was the single most impactful change?

Was the four-times token cost worth it?

How long did the rebuild take?

Could a better single prompt have solved this without decomposition?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?