The most instructive lessons about sampling control rarely come from documentation. They come from watching a real team hit a wall, misdiagnose it, and eventually find the actual cause. This is one such story, reconstructed as a narrative arc: the situation, the wrong turns, the decision that worked, and what it cost and saved.
The team in question ran a customer-support assistant for a mid-sized subscription product. Names and specifics are generalized, but the shape of the problem is one we have seen many times. What makes it worth telling is not that the fix was clever — it was almost embarrassingly simple — but that the path to it was full of plausible wrong turns.
If you have ever had a model behave inconsistently and blamed everything except the setting, this story will feel familiar.
The Situation
The assistant answered routine account questions: billing dates, plan differences, how to cancel. It worked well in testing and shipped to a fraction of live traffic.
The Symptom
Within two weeks, support tickets started arriving that referenced answers the company had never authorized. The assistant had told one customer about a refund window that did not exist and described a plan tier that had been discontinued. The answers were fluent, confident, and wrong.
The Stakes
These were not cosmetic errors. A confidently stated but false refund policy creates real obligations and erodes trust. The team faced pressure to either fix the assistant quickly or pull it entirely.
The Wrong Turns
The first instinct was to blame the model's knowledge, so the team poured effort into the prompt and the retrieved context.
Rewriting the Prompt
They added stern instructions: "Only answer from the provided context. Never speculate." It helped marginally but the invented answers kept appearing, just less often. The improvement was real but insufficient, and it masked the actual cause.
Suspecting the Retrieval
Next they audited the retrieval layer, assuming the model was getting bad context. The context was fine. The model was being handed correct information and still occasionally improvising around it. This ruled out the obvious culprit and forced a harder look.
The Decision
A review of the configuration surfaced the overlooked detail: the assistant was running at a temperature of 0.9, a value carried over from an earlier content-generation project.
The Insight
At 0.9, the model was willing to reach for less-likely tokens — which, for a support assistant, meant occasionally phrasing speculation as fact. The high temperature was not making the assistant smarter; it was giving it license to wander away from the provided context. This was a textbook version of the common mistake of carrying a creative setting into a reliability task.
The Change
Following a structured sweep like the one in our step-by-step process, the team tested the assistant across temperatures from 0.0 to 0.9 against a fixed set of real customer questions. Quality and adherence to context peaked in the 0.3 to 0.4 range — natural-sounding but tightly anchored to the provided facts. They set the temperature to 0.35 and held top-p at 1.0.
The Outcome
The change took minutes to deploy. Its effects were visible within the week.
What Improved
- Tickets referencing unauthorized answers dropped to near zero over the following two weeks.
- The assistant's tone remained natural; customers did not perceive it as more robotic.
- The team regained confidence to expand the assistant to more of their traffic.
What It Cost
The fix itself cost almost nothing. The expensive part was the two weeks spent rewriting prompts and auditing retrieval before anyone looked at the temperature. That lost time is the real lesson.
What the Team Changed in Their Process
Fixing one assistant was not the point; preventing the next two weeks of waste was. The team adjusted how they worked.
A Standing Diagnostic Order
They wrote a short diagnostic order for any inconsistent output: check the sampling setting first, then the prompt, then the retrieved context. Reversing the order they had used by accident meant the cheapest, fastest check now came first. This single reordering would have caught the original problem on day one.
Settings Reviewed at Project Handoff
The high temperature arrived because a setting was inherited silently from another project. The team made sampling settings an explicit item in any project handoff, so an inherited number is reviewed rather than assumed correct. They cross-referenced each handoff against the band guidance in the examples guide to confirm the inherited value matched the new task type.
A Shared Record of Live Settings
They built a single record of every live assistant's task, temperature, and prompt version. When the next model upgrade landed, they could see at a glance which settings might need a fresh sweep, rather than discovering drift through complaints. This record became the backbone of their working checklist.
The Lessons
Stepping back, the episode reinforces a few durable principles.
Check the Setting Early
When a model behaves inconsistently, the sampling setting belongs near the top of the diagnostic checklist, not the bottom. The team would have saved two weeks by checking it first. The best-practices guide now lists this as a standing diagnostic step.
Settings Do Not Travel
A temperature tuned for content generation was actively harmful for support. Carrying settings across projects without re-tuning is a quiet, recurring source of failure. The examples guide shows just how far the right setting shifts between task types.
Document So It Does Not Recur
The team added the assistant's task, temperature, and prompt version to a shared record, so the next person to touch it would not undo the fix by accident.
What Generalizes Beyond This One Team
It would be easy to read this as a story about one chatbot. The mechanics generalize to a wide range of situations.
Any Fact-Bound Generative Task
The same dynamic appears wherever a model is supposed to answer from provided material rather than from its own latitude: document question answering, policy lookups, internal knowledge assistants, and compliance-sensitive summaries. In all of these, a higher temperature increases the chance the model phrases something it is not entitled to assert as a fact. The remedy is the same — anchor low enough that the model stays tied to its source, then verify the tone is still acceptable.
The Diagnostic Lesson Travels Furthest
More broadly, the episode is a lesson about diagnostic order. The team's two-week detour happened because they investigated the expensive, complex causes before the cheap, simple one. This pattern repeats across engineering: a one-line setting gets overlooked precisely because it seems too trivial to be the culprit. Building a habit of checking the simplest controls first, as the best-practices guide recommends, pays off well beyond sampling parameters.
A Note on Measurement
The team could only confirm the fix because they were tracking tickets that referenced unauthorized answers. Without that signal, the improvement would have been invisible and unprovable. Whatever the task, having a concrete outcome metric — even a rough one — is what turns a plausible fix into a verified one.
Frequently Asked Questions
Why did a high temperature cause invented answers?
At a high temperature the model is more willing to choose less-likely tokens, which for a fact-bound assistant means occasionally phrasing speculation as fact. Lowering the temperature kept the model anchored to the context it was given.
Could a better prompt alone have fixed this?
It helped but did not fully solve the problem. The prompt reduced the frequency of invented answers, while the temperature change addressed the root cause. The two work together; neither alone was sufficient here.
How did the team know 0.35 was right?
They ran a structured sweep across temperatures against real customer questions and watched where adherence to context peaked while the tone stayed natural. The setting was chosen from evidence, not from a guess.
Did lowering temperature make the bot sound robotic?
No. Customers did not perceive a meaningful tone change. There is a common fear that lower temperature ruins voice, but the difference between 0.35 and 0.9 was invisible to users while the reliability gain was large.
What is the single biggest takeaway?
Check the sampling setting early when output is inconsistent. The team lost two weeks chasing the prompt and retrieval before examining a number that took minutes to fix.
Key Takeaways
- A support assistant invented policies because it ran at a high temperature carried over from a content project.
- Prompt rewrites and retrieval audits reduced but did not resolve the problem, because the root cause was the setting.
- A structured sweep against real questions showed quality peaked near 0.35, which the team adopted.
- Invented answers dropped to near zero with no perceptible loss of natural tone.
- Check sampling settings early, never assume settings travel between projects, and document the fix so it sticks.