AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Situation: Wrong Numbers Hiding in Good ProseHow the Workflow Was BuiltWhy the Error Went UnnoticedThe Decision: Treat It as a Process Problem, Not a Model ProblemThe Failed First FixThe ReframeThe Execution: Separating Calculation From NarrationStage One: Compute the NumbersStage Two: Verify Before NarratingStage Three: Narrate From Confirmed FiguresThe Outcome: A Dependable ReportWhat Changed MeasurablyWhat It CostThe Lessons: What GeneralizesPolish Is Not CorrectnessSeparate Responsibilities DeliberatelyThe Broader Rollout: From One Fix to a StandardAuditing Other Workflows for the Same PatternMaking the Practice StickFrequently Asked QuestionsWhy did adding "double-check the math" not fix the problem?Could the team have caught the error earlier?Did the rebuild slow the workflow down?Does this apply outside of reporting?What was the single most important change?Key Takeaways
Home/Blog/When a Reporting Pipeline Kept Quoting the Wrong Totals
General

When a Reporting Pipeline Kept Quoting the Wrong Totals

A

Agency Script Editorial

Editorial Team

·May 24, 2020·9 min read
prompting for numerical reasoning tasksprompting for numerical reasoning tasks case studyprompting for numerical reasoning tasks guideprompt engineering

A small analytics team had built something that looked like a win. They used a language model to turn raw campaign data into client-ready summaries — a few paragraphs of narrative with the key figures woven in. It saved hours each week, the prose read well, and clients liked the clarity. Then a client noticed that a spend total in one report did not match their own records. It was off by a few hundred dollars. Small, but it was a number the team had presented as fact.

This is the story of how they traced the problem, what they got wrong in their first attempts to fix it, and the workflow they eventually landed on. It is a useful arc because the failure was not dramatic — no obviously broken output, just a quietly wrong number inside otherwise excellent work. That is the most dangerous kind, and the most common.

The names and exact figures are illustrative, but the shape of the problem and the sequence of fixes reflect how numerical reliability actually gets built into an AI workflow. The lessons generalize well beyond reporting.

The Situation: Wrong Numbers Hiding in Good Prose

The team's setup was straightforward and, in retrospect, exactly the kind of design that produces silent errors.

How the Workflow Was Built

Raw data went into a single prompt that asked the model to write a summary including totals, percentages, and period-over-period changes. The model computed everything inline while writing the narrative. The output was fluent and looked authoritative, which was precisely the problem.

Why the Error Went Unnoticed

Because the wrong total sat inside well-written prose, nothing flagged it. The team had been reading the reports for tone and clarity, not auditing each figure. The polish of the writing had been quietly standing in for correctness. The danger of numbers computed inside narrative is covered in Where Numerical Reasoning Prompts Earn Their Keep.

The Decision: Treat It as a Process Problem, Not a Model Problem

Their first instinct was wrong, and recognizing that was the turning point.

The Failed First Fix

They tried prompting harder — adding "double-check all calculations and ensure totals are accurate" to the instruction. Accuracy improved slightly and then errors crept back. Appeals to accuracy did not change the underlying behavior, only nudged it. This matched the lesson in 7 Mistakes That Wreck Numerical Reasoning Prompts.

The Reframe

The breakthrough was deciding the model should not be computing the numbers at all. The arithmetic belonged somewhere deterministic; the model's job was to interpret and narrate. Once they separated those two responsibilities, the path forward was clear.

The Execution: Separating Calculation From Narration

They rebuilt the workflow in stages rather than as one prompt.

Stage One: Compute the Numbers

They had the model identify which figures the report needed and write them as a small calculation in code, which executed to produce exact values. No total was ever computed in prose again. This single change eliminated the arithmetic errors that had started the whole problem.

Stage Two: Verify Before Narrating

Each computed figure passed a quick sanity check — within expected bounds, consistent with the prior period — before anything was written. Only verified numbers were handed to the narration step. The structured separation drew directly on The FRAME Method for Numerical Reasoning Prompts.

Stage Three: Narrate From Confirmed Figures

Finally, the model wrote the summary using the already-verified numbers as fixed inputs it was not allowed to recalculate. The prose stayed as good as before, but now it described numbers that had been computed and checked outside the narrative.

The Outcome: A Dependable Report

The rebuilt workflow held up where the original had quietly failed.

What Changed Measurably

Numerical discrepancies in client reports went from an occasional embarrassment to effectively none over the following months. The figures matched client records because they were computed deterministically and verified before use. Crucially, the time savings survived — the pipeline was a few steps longer but still far faster than manual reporting.

What It Cost

The rebuild took a couple of days and added modest complexity: more stages, a bit of code, a verification gate. The team judged that a worthwhile trade for figures they could actually stand behind. The economics echoed the practices in Field Practices That Make Model Math Dependable.

The Lessons: What Generalizes

The specifics were about reporting, but the takeaways apply to any numerical AI workflow.

Polish Is Not Correctness

The root cause was mistaking fluent output for accurate output. Any workflow where a model computes numbers inside free-form text is vulnerable to the same silent failure, regardless of domain.

Separate Responsibilities Deliberately

The fix was not a cleverer prompt but a cleaner division of labor: deterministic computation, explicit verification, then narration from fixed inputs. That separation is the durable lesson, and it transfers to anything where a model handles numbers as part of a larger task.

The Broader Rollout: From One Fix to a Standard

The reporting fix worked so well that the team treated it as a template for the rest of their AI-assisted work, and that generalization is where the real payoff came.

Auditing Other Workflows for the Same Pattern

Once they understood the failure, they went looking for it elsewhere. Any workflow where the model computed a number inside free-form output was a suspect. They found several — a proposal generator that calculated project totals in prose, an email drafter that quoted percentages, a dashboard summarizer that derived period changes inline. Each had the same latent risk, and each got the same treatment: pull the arithmetic out, verify it, narrate from fixed inputs.

  • They inventoried every place numbers met narrative. Listing the suspects made the scope of the risk visible rather than letting it stay hidden.
  • They prioritized by stakes. Workflows producing client-facing or contractual numbers got rebuilt first; internal-only ones could wait.
  • They standardized the pattern. The same three-stage structure became the default for any numerical AI work, so new workflows started correct instead of being fixed later.

Making the Practice Stick

A fix only lasts if the team keeps applying it after the original pain fades. They captured the pattern as a reusable structure — the kind of encoding described in The FRAME Method for Numerical Reasoning Prompts — and made it part of how new AI features were reviewed. New work that computed numbers in prose was flagged before it shipped, so the silent-error class stopped recurring rather than being rediscovered each time.

Frequently Asked Questions

Why did adding "double-check the math" not fix the problem?

Because the model's difficulty with arithmetic is structural, not a matter of effort it could supply when prompted. The instruction nudged behavior slightly but did not change the mechanism producing the errors, so they returned. The real fix was removing arithmetic from the model's job entirely and computing it deterministically instead.

Could the team have caught the error earlier?

Yes, with a verification step on every figure before reports went out. The original workflow had no such step — the team reviewed for tone, not numerical accuracy, and the polished prose masked the wrong number. A designed verification gate, rather than ad hoc reading, is what would have caught it.

Did the rebuild slow the workflow down?

It added a few stages and a small amount of code, so each report took marginally longer to generate. But the pipeline remained dramatically faster than manual reporting, and the added reliability was worth the modest overhead. The time savings that justified the original automation survived the rebuild intact.

Does this apply outside of reporting?

The pattern applies to any workflow where a model produces numbers as part of a larger output — quotes, projections, analyses, summaries with embedded figures. Wherever arithmetic happens inside free-form generation, the same silent-error risk exists, and the same fix of separating computation, verification, and narration addresses it.

What was the single most important change?

Moving the arithmetic out of the model and into deterministic code. That one change eliminated the class of errors that had started the whole problem. The verification gate and the narration-from-fixed-inputs steps reinforced it, but computing the numbers deterministically was the move that mattered most.

Key Takeaways

  • A fluent AI report quietly carried wrong totals because numbers were computed inside the narrative, where polish masked errors.
  • Prompting the model to be more accurate produced only temporary improvement because the arithmetic weakness is structural.
  • The fix was a clean division of labor: compute numbers in deterministic code, verify each figure, then narrate from fixed inputs.
  • Reliability rose to near zero discrepancies while the time savings that justified the automation survived the rebuild.
  • The durable lesson is that polish is not correctness, and any workflow computing numbers inside free-form text shares this risk.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification