The fastest way to understand token budgeting is to follow a team through a real problem from start to finish. This is a composite account of a support automation team whose language model bill grew faster than their traffic for three straight months, until nobody could explain the gap between usage and cost. It follows them through diagnosis, decision, execution, and the measured outcome, and ends with the lessons they took away.
The situation is common enough to be instructive: a feature that worked well in testing, shipped without a deliberate token budget, and slowly turned into the largest line item on the engineering invoice. What makes the story useful is not that they fixed it β many teams do β but the specific decisions they made and the order in which they made them.
The names and exact figures are illustrative, but the shape of the problem and the moves that resolved it reflect patterns we see repeatedly. Read it as a worked example of the discipline applied under real pressure.
The Situation
The team ran a support assistant that answered customer questions using both conversation history and retrieved help-center articles.
Symptoms
Over a quarter, traffic grew about 30 percent while the model bill nearly doubled. Long support sessions occasionally produced incoherent answers, as if the assistant had forgotten earlier context. Nobody could point to a cause because nobody was measuring where tokens went.
The Pressure
Finance flagged the line item, and leadership asked for a plan to cut it by half without degrading the support experience. The team had two weeks. The full discipline they reached for is laid out in Spending Tokens Like Money: A Working Manual for LLM Budgets.
The Diagnosis
Before touching anything, the team instrumented the system to see where tokens actually went.
Instrumenting the Prompt
They logged token counts for each prompt component β system prompt, retrieved articles, conversation history, and user message β plus output tokens, across a sample of real sessions. The breakdown was revealing.
What They Found
Retrieved articles were sent as full documents and accounted for nearly half of input tokens. Conversation history was unbounded and grew without limit across long sessions, occasionally overflowing the window and silently dropping the customer's original question. Output was uncapped, and a tail of very long answers drove a disproportionate share of cost. The diagnostic approach matches Cut Your Token Costs This Afternoon: An Ordered Routine.
The Decisions
With the breakdown in hand, the targets were obvious, and the team prioritized by expected return.
Retrieval First
Because retrieved articles were the single largest consumer, they led with it. The decision was to chunk articles, rerank chunks against the question, and include only the top three rather than full documents.
History Second
The unbounded history both cost money and caused the incoherence, so it was the next target. The decision was to keep the last four turns verbatim, summarize older turns into a running record, and cap history at a fixed token budget.
Output Third
Finally, they decided to cap output length and ask for more concise, structured answers, since the long tail of verbose responses was hitting the pricier side of the ledger.
The Execution
The team rolled the changes out carefully rather than all at once, to isolate effects and catch regressions.
Staged Rollout
They shipped retrieval changes first to a fraction of traffic, compared token counts and answer quality against the baseline, and confirmed both improved before expanding. History and output changes followed the same pattern.
Guarding Quality
For each change they compared answers before and after on a fixed set of real questions. When the first history summary lost some account details, they adjusted what the summary preserved and re-verified. This is where the work became engineering rather than arithmetic, the same tension covered in Token Budget Management and Optimization: Real-World Examples and Use Cases.
The Outcome
The measured result met the target and produced a few unexpected benefits.
The Numbers
Input tokens per request fell by more than half, driven mostly by the retrieval change. Output cost dropped once length was capped. Overall cost per request came down enough to bring the bill back below its level from the start of the quarter, despite higher traffic.
The Surprises
Answer quality went up, not down. The focused retrieval context removed distracting noise, and the deliberate history summary stopped the assistant from forgetting the customer's original problem. Cutting cost and improving the experience turned out to point the same direction.
What They Kept
They moved every limit into central configuration and added a monthly review of token telemetry, so the gains would not quietly erode. That working tool resembles The Token Budget Management and Optimization Checklist for 2026.
The Lessons They Carried Forward
The team came out of the two weeks with more than a smaller bill. They came out with a set of habits they applied to every feature after it.
Measurement Is Not Optional
Their biggest regret was shipping the feature without any token instrumentation in the first place. For three months the cost grew and nobody could say why, because the data did not exist. After this, every new LLM feature shipped with per-component token logging from day one. The cost of instrumentation was trivial next to the cost of flying blind.
Cost and Quality Are Not Always Opposed
Going in, the team assumed cutting cost meant accepting worse answers, and they braced leadership for that trade. The opposite happened. Focused retrieval and a deliberate history summary improved the experience while reducing tokens. The lesson was not that this always happens, but that the trade-off should be measured rather than assumed. Sometimes the cheaper design is also the better one.
Order of Attack Matters
They saved the most by going after the largest consumer first. Had they started with the system prompt, which was already lean, they would have spent effort for little return. Prioritizing by measured size, rather than by what felt wasteful, directed their limited time to where it paid. This prioritization is the spine of Cut Your Token Costs This Afternoon: An Ordered Routine.
Gains Need Guardrails
The final habit was distrust of their own discipline. They knew that without enforcement, the components would creep back. Centralized limits and a monthly review were not bureaucracy; they were the only thing standing between the savings and their slow reversal.
Frequently Asked Questions
Why did the bill grow faster than traffic?
Because two components grew with use rather than staying fixed. Unbounded history got more expensive every turn, and full-document retrieval sent large amounts of text per request. Both compounded as sessions lengthened and traffic rose.
Why start with retrieval instead of history?
Retrieval was the single largest token consumer, so equal effort there returned the most. The team prioritized by measured size, and the breakdown put retrieved articles at the top.
How did quality improve while cost fell?
Focused retrieval removed distracting irrelevant text, helping the model find the right answer, and a deliberate history summary preserved the customer's original problem instead of dropping it. Both changes happened to improve answers while reducing tokens.
Was a staged rollout necessary?
It was prudent. Shipping changes to a fraction of traffic first let the team compare against the baseline and catch a history-summary regression before it reached everyone. Isolating each change made the effects legible.
How did they prevent the savings from reversing?
By centralizing every limit in configuration and instituting a monthly review of token telemetry. Enforcement and regular review keep components from creeping back to their old sizes.
Key Takeaways
- A bill that grows faster than traffic usually points to components that scale with use, like unbounded history and full-document retrieval.
- Instrument prompt assembly to see per-component token usage before deciding what to change.
- Prioritize optimization by measured size, attacking the largest consumer first for the best return.
- Roll out changes in stages and verify quality against a fixed set of real cases at each step.
- Centralize limits and review telemetry regularly so the savings do not erode over time.