AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationThe SymptomThe StakesThe Wrong TurnsRewriting the PromptSuspecting the RetrievalThe DecisionThe InsightThe ChangeThe OutcomeWhat ImprovedWhat It CostWhat the Team Changed in Their ProcessA Standing Diagnostic OrderSettings Reviewed at Project HandoffA Shared Record of Live SettingsThe LessonsCheck the Setting EarlySettings Do Not TravelDocument So It Does Not RecurWhat Generalizes Beyond This One TeamAny Fact-Bound Generative TaskThe Diagnostic Lesson Travels FurthestA Note on MeasurementFrequently Asked QuestionsWhy did a high temperature cause invented answers?Could a better prompt alone have fixed this?How did the team know 0.35 was right?Did lowering temperature make the bot sound robotic?What is the single biggest takeaway?Key Takeaways
Home/Blog/How a Support Bot Stopped Inventing Refund Policies
General

How a Support Bot Stopped Inventing Refund Policies

A

Agency Script Editorial

Editorial Team

·June 8, 2023·7 min read
temperature and creativity controltemperature and creativity control case studytemperature and creativity control guideprompt engineering

The most instructive lessons about sampling control rarely come from documentation. They come from watching a real team hit a wall, misdiagnose it, and eventually find the actual cause. This is one such story, reconstructed as a narrative arc: the situation, the wrong turns, the decision that worked, and what it cost and saved.

The team in question ran a customer-support assistant for a mid-sized subscription product. Names and specifics are generalized, but the shape of the problem is one we have seen many times. What makes it worth telling is not that the fix was clever — it was almost embarrassingly simple — but that the path to it was full of plausible wrong turns.

If you have ever had a model behave inconsistently and blamed everything except the setting, this story will feel familiar.

The Situation

The assistant answered routine account questions: billing dates, plan differences, how to cancel. It worked well in testing and shipped to a fraction of live traffic.

The Symptom

Within two weeks, support tickets started arriving that referenced answers the company had never authorized. The assistant had told one customer about a refund window that did not exist and described a plan tier that had been discontinued. The answers were fluent, confident, and wrong.

The Stakes

These were not cosmetic errors. A confidently stated but false refund policy creates real obligations and erodes trust. The team faced pressure to either fix the assistant quickly or pull it entirely.

The Wrong Turns

The first instinct was to blame the model's knowledge, so the team poured effort into the prompt and the retrieved context.

Rewriting the Prompt

They added stern instructions: "Only answer from the provided context. Never speculate." It helped marginally but the invented answers kept appearing, just less often. The improvement was real but insufficient, and it masked the actual cause.

Suspecting the Retrieval

Next they audited the retrieval layer, assuming the model was getting bad context. The context was fine. The model was being handed correct information and still occasionally improvising around it. This ruled out the obvious culprit and forced a harder look.

The Decision

A review of the configuration surfaced the overlooked detail: the assistant was running at a temperature of 0.9, a value carried over from an earlier content-generation project.

The Insight

At 0.9, the model was willing to reach for less-likely tokens — which, for a support assistant, meant occasionally phrasing speculation as fact. The high temperature was not making the assistant smarter; it was giving it license to wander away from the provided context. This was a textbook version of the common mistake of carrying a creative setting into a reliability task.

The Change

Following a structured sweep like the one in our step-by-step process, the team tested the assistant across temperatures from 0.0 to 0.9 against a fixed set of real customer questions. Quality and adherence to context peaked in the 0.3 to 0.4 range — natural-sounding but tightly anchored to the provided facts. They set the temperature to 0.35 and held top-p at 1.0.

The Outcome

The change took minutes to deploy. Its effects were visible within the week.

What Improved

  • Tickets referencing unauthorized answers dropped to near zero over the following two weeks.
  • The assistant's tone remained natural; customers did not perceive it as more robotic.
  • The team regained confidence to expand the assistant to more of their traffic.

What It Cost

The fix itself cost almost nothing. The expensive part was the two weeks spent rewriting prompts and auditing retrieval before anyone looked at the temperature. That lost time is the real lesson.

What the Team Changed in Their Process

Fixing one assistant was not the point; preventing the next two weeks of waste was. The team adjusted how they worked.

A Standing Diagnostic Order

They wrote a short diagnostic order for any inconsistent output: check the sampling setting first, then the prompt, then the retrieved context. Reversing the order they had used by accident meant the cheapest, fastest check now came first. This single reordering would have caught the original problem on day one.

Settings Reviewed at Project Handoff

The high temperature arrived because a setting was inherited silently from another project. The team made sampling settings an explicit item in any project handoff, so an inherited number is reviewed rather than assumed correct. They cross-referenced each handoff against the band guidance in the examples guide to confirm the inherited value matched the new task type.

A Shared Record of Live Settings

They built a single record of every live assistant's task, temperature, and prompt version. When the next model upgrade landed, they could see at a glance which settings might need a fresh sweep, rather than discovering drift through complaints. This record became the backbone of their working checklist.

The Lessons

Stepping back, the episode reinforces a few durable principles.

Check the Setting Early

When a model behaves inconsistently, the sampling setting belongs near the top of the diagnostic checklist, not the bottom. The team would have saved two weeks by checking it first. The best-practices guide now lists this as a standing diagnostic step.

Settings Do Not Travel

A temperature tuned for content generation was actively harmful for support. Carrying settings across projects without re-tuning is a quiet, recurring source of failure. The examples guide shows just how far the right setting shifts between task types.

Document So It Does Not Recur

The team added the assistant's task, temperature, and prompt version to a shared record, so the next person to touch it would not undo the fix by accident.

What Generalizes Beyond This One Team

It would be easy to read this as a story about one chatbot. The mechanics generalize to a wide range of situations.

Any Fact-Bound Generative Task

The same dynamic appears wherever a model is supposed to answer from provided material rather than from its own latitude: document question answering, policy lookups, internal knowledge assistants, and compliance-sensitive summaries. In all of these, a higher temperature increases the chance the model phrases something it is not entitled to assert as a fact. The remedy is the same — anchor low enough that the model stays tied to its source, then verify the tone is still acceptable.

The Diagnostic Lesson Travels Furthest

More broadly, the episode is a lesson about diagnostic order. The team's two-week detour happened because they investigated the expensive, complex causes before the cheap, simple one. This pattern repeats across engineering: a one-line setting gets overlooked precisely because it seems too trivial to be the culprit. Building a habit of checking the simplest controls first, as the best-practices guide recommends, pays off well beyond sampling parameters.

A Note on Measurement

The team could only confirm the fix because they were tracking tickets that referenced unauthorized answers. Without that signal, the improvement would have been invisible and unprovable. Whatever the task, having a concrete outcome metric — even a rough one — is what turns a plausible fix into a verified one.

Frequently Asked Questions

Why did a high temperature cause invented answers?

At a high temperature the model is more willing to choose less-likely tokens, which for a fact-bound assistant means occasionally phrasing speculation as fact. Lowering the temperature kept the model anchored to the context it was given.

Could a better prompt alone have fixed this?

It helped but did not fully solve the problem. The prompt reduced the frequency of invented answers, while the temperature change addressed the root cause. The two work together; neither alone was sufficient here.

How did the team know 0.35 was right?

They ran a structured sweep across temperatures against real customer questions and watched where adherence to context peaked while the tone stayed natural. The setting was chosen from evidence, not from a guess.

Did lowering temperature make the bot sound robotic?

No. Customers did not perceive a meaningful tone change. There is a common fear that lower temperature ruins voice, but the difference between 0.35 and 0.9 was invisible to users while the reliability gain was large.

What is the single biggest takeaway?

Check the sampling setting early when output is inconsistent. The team lost two weeks chasing the prompt and retrieval before examining a number that took minutes to fix.

Key Takeaways

  • A support assistant invented policies because it ran at a high temperature carried over from a content project.
  • Prompt rewrites and retrieval audits reduced but did not resolve the problem, because the root cause was the setting.
  • A structured sweep against real questions showed quality peaked near 0.35, which the team adopted.
  • Invented answers dropped to near zero with no perceptible loss of natural tone.
  • Check sampling settings early, never assume settings travel between projects, and document the fix so it sticks.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification