AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Situation: An Assistant Nobody TrustedThe Breaking PointWhy the Old Approach FailedThe Cost Nobody Had MeasuredThe Decision: Ground It in the Real DocsChoosing Grounding Over Fine-TuningSetting a Measurable GoalThe Execution: Three Rounds of CorrectionRound One, Retrieval Was Returning NoiseRound Two, Better Chunks and Fewer of ThemRound Three, Adding Citations and a Refusal ClauseThe Cultural HurdleThe Outcome: Trust Restored, MeasuredThe Numbers That MatteredThe Softer WinWhat the Refusals RevealedThe Lessons That GeneralizedInspect Retrieval First, AlwaysLess Context, Cited and HonestTreat the Source Material as Part of the SystemFrequently Asked QuestionsWhy not just fine-tune the model on the documentation?What was the most surprising finding?How did citations change agent behavior?Was a single test set really enough to guide the project?Key Takeaways
Home/Blog/How One Support Team Cut Wrong Answers in Half
General

How One Support Team Cut Wrong Answers in Half

A

Agency Script Editorial

Editorial Team

Β·August 16, 2022Β·8 min read
grounding prompts with retrieved contextgrounding prompts with retrieved context case studygrounding prompts with retrieved context guideprompt engineering

The clearest way to understand grounding is to watch a team live through it. This case follows a mid-sized software company's support organization as it moved from an ungrounded assistant that embarrassed them to a grounded one that earned the team's trust. The company and figures are composite, assembled from patterns that recur across these projects, but the arc is faithful to how these efforts actually unfold: a painful trigger, a decision, a messy execution, a measurable result, and a set of lessons that outlast the project.

What makes this case worth studying is not that grounding worked, but how the team discovered where it could go wrong and corrected course. The failures along the way are the instructive part.

We pick up the story at the point where the existing approach broke down.

The Situation: An Assistant Nobody Trusted

The Breaking Point

The support team had deployed a model to draft answers to customer tickets. It was fast and articulate, and it was wrong often enough that agents stopped using it. The model answered from its general training, which meant it confidently described features the product did not have and quoted policies that did not exist. Each fabricated answer that slipped through cost an agent a correction and the customer a moment of doubt.

Why the Old Approach Failed

Nothing in the setup connected the model to the company's actual documentation. It was guessing, fluently, every time. The team realized the problem was not the model's intelligence but its ignorance of their specific reality. The product changed constantly, the policies were updated by a different department, and the model had no window into any of it. Asking it to draft accurate answers was like asking a brilliant stranger to speak for the company without ever showing them the company's handbook.

The Cost Nobody Had Measured

What made the situation urgent was that the cost had been invisible. Every fabricated answer that an agent caught and rewrote burned a few minutes, and the ones that slipped through cost far more in customer confidence and follow-up tickets. When the support lead finally tallied the rework, the assistant was creating nearly as much work as it saved. That number, more than any complaint, justified the project.

The Decision: Ground It in the Real Docs

Choosing Grounding Over Fine-Tuning

The team weighed fine-tuning the model on their documentation against grounding it with retrieved context. Fine-tuning was expensive and would need redoing every time the docs changed, which was weekly. Grounding kept the model fixed and supplied facts at query time, so updates meant simply re-indexing. They chose grounding. The reasoning tracks the comparison in Grounding Prompts with Retrieved Context: Trade-offs, Options, and How to Decide.

Setting a Measurable Goal

Rather than chase a vague sense of improvement, the team defined success concretely: cut the rate of factually wrong answers, as judged by agents, by at least half, without slowing response time enough to matter.

The Execution: Three Rounds of Correction

Round One, Retrieval Was Returning Noise

The first build chunked documents by fixed length and retrieved the top ten passages. Answers barely improved. When the team finally inspected the retrieved chunks, they found the right answer was often not among them, and the ten passages were mostly irrelevant filler. Retrieval, not the model, was the bottleneck, exactly the first failure named in 7 Common Mistakes with Grounding Prompts with Retrieved Context.

Round Two, Better Chunks and Fewer of Them

They re-chunked documents along section boundaries with a small overlap and cut retrieval to the top four passages. Answer quality jumped. The smaller, cleaner context let the model focus, and relevant material stopped getting buried. This was the turning point.

Round Three, Adding Citations and a Refusal Clause

Finally they instructed the model to cite the document section behind each claim and to say plainly when the docs did not cover a question. Agents could now verify answers at a glance, and the assistant stopped inventing answers to questions the documentation never addressed.

The Cultural Hurdle

The technical changes were the easy part. Harder was convincing agents to give the assistant a second chance after it had burned them. The team ran a two-week pilot with a handful of volunteers, shared the before-and-after numbers openly, and let skeptics keep a list of any answer that still went wrong. The list stayed short, and word spread. By the time the assistant rolled out to the whole team, the volunteers had become its advocates, which mattered more than any internal memo could have.

The Outcome: Trust Restored, Measured

The Numbers That Mattered

Against their standing test set of real tickets, the rate of factually wrong answers fell well past the halved target. Just as important, agents began using the assistant again, because a cited answer they could verify in seconds was worth drafting from. Response time rose only marginally, well within the acceptable range.

The Softer Win

Beyond the metric, the refusal clause changed the team's relationship with the tool. An assistant that admits ignorance is one agents can rely on, because they learn exactly when to step in. The honesty bought more trust than any accuracy gain alone.

What the Refusals Revealed

The refusals carried information of their own. When the team reviewed the questions the assistant declined to answer, clusters emerged: whole categories of customer questions the documentation simply did not cover. Rather than seeing these as gaps in the tool, the team treated them as a backlog for the documentation writers. Each cluster of refusals became a request for a new help article. Over time the refusal rate dropped, not because the assistant started guessing, but because the documentation grew to match what customers actually asked.

The Lessons That Generalized

Inspect Retrieval First, Always

The single biggest delay came from tuning prompts before checking whether retrieval returned the right chunks. Once they looked at retrieval directly, progress accelerated. This habit anchors the workflow in Build a Grounded Prompt Pipeline in Eight Concrete Steps.

Less Context, Cited and Honest

Fewer, better passages plus citations plus permission to decline did most of the work. The team's instinct to add more context had been making things worse, not better.

Treat the Source Material as Part of the System

A lesson the team did not expect was how much the documentation itself mattered. Several early failures traced back not to retrieval or prompting but to a help article that was simply wrong or out of date. Grounding had quietly turned the assistant into an auditor of the company's own docs, surfacing inconsistencies that humans had tolerated for years. The support and documentation teams began meeting regularly, because improving the docs now directly improved the assistant. Grounding had made the quality of the source material everyone's concern, not just the writers'.

Frequently Asked Questions

Why not just fine-tune the model on the documentation?

Because the documentation changed weekly, and fine-tuning would have required expensive retraining each time. Grounding let the team update answers by simply re-indexing the changed documents, with no retraining at all.

What was the most surprising finding?

That reducing the amount of retrieved context improved answers. The team assumed more context was safer, but the extra passages diluted the model's focus and buried the relevant material.

How did citations change agent behavior?

Agents could verify an answer in seconds by checking the cited section, so they trusted and used the assistant again. Citations also exposed any fabrication instantly, because invented claims had no source.

Was a single test set really enough to guide the project?

It was essential. The standing set of real tickets turned vague impressions into measurable progress and caught regressions after each change, which made the three rounds of correction possible.

Key Takeaways

  • An ungrounded assistant fabricated answers because nothing connected it to the company's real documentation.
  • The team chose grounding over fine-tuning so weekly doc changes meant re-indexing, not retraining.
  • The turning point was better chunking and fewer retrieved passages, not prompt wording.
  • Citations plus a refusal clause restored agent trust and exposed fabrication at a glance.
  • Inspecting retrieval first and measuring against a standing test set drove the whole improvement.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification