Reading about token management is one thing. Sitting down with a live feature that costs too much and fixing it is another. This guide is the second thing. It is an ordered routine you can run today against a real prompt, moving step by step from measurement to enforced limits, without relying on intuition about where the tokens are going.
The routine works on any LLM feature: a chatbot, a retrieval-augmented question answerer, a summarizer, a classification pipeline. The specifics of each step change with the feature, but the sequence does not. Measure first, find the largest consumers, cut the cheapest waste, enforce the new limits, and verify you did not break quality. Skip a step and you risk optimizing the wrong thing or shipping a regression.
Have your codebase open and a representative sample of real prompts on hand. The whole routine takes an afternoon for a single feature, and most of that time is measurement and verification rather than code changes.
Step One: Capture a Baseline
You cannot tell whether you improved anything if you never measured where you started.
Instrument the Prompt Assembly
Find the place in your code where the final prompt is built. Right before the request is sent, count the tokens in each component separately β system prompt, retrieved context, conversation history, and user message β using your provider's tokenizer. Log those numbers.
Run a Representative Sample
Send a few dozen real or realistic requests through the instrumented path. Record the per-component token counts and the output token count for each. You now have a profile of where tokens actually go, which is almost always different from where you assumed they went.
Note the Cost Per Request
Multiply input and output tokens by their respective prices and write down the average cost per request. This is the number every later step will be judged against. The broader rationale for starting here is laid out in Spending Tokens Like Money: A Working Manual for LLM Budgets.
Step Two: Identify the Largest Consumers
With a baseline in hand, the optimization targets pick themselves.
Sort Components by Size
Rank your four components by average token count. In most retrieval systems the retrieved context dominates. In long chat sessions the history dominates. In simple features a bloated system prompt is often the culprit. Whatever leads the list is where you start.
Separate Fixed From Variable
Some components are fixed per request, like the system prompt. Others grow with use, like history or retrieved documents. Variable components are usually the better target because their growth is what makes costs unpredictable over time.
Check the Output Side
Look at the distribution of output lengths, not just the average. If a minority of responses run very long, capping output length will save more than any input change. Output is usually the pricier side, so a small cut there pays well.
Step Three: Cut the Cheapest Waste First
Not all reductions cost the same effort. Start with the ones that are nearly free.
Prune the System Prompt
Read your system prompt line by line and delete anything that no longer changes behavior. Old instructions added for problems that no longer exist are pure waste paid on every request. This is the single fastest win available.
Cap the Output Length
Set a maximum output token count appropriate to the feature. If answers should be a paragraph, do not allow space for an essay. This prevents runaway generations immediately and predictably.
Trim and Rerank Retrieved Context
If you include retrieved documents, rerank them by relevance and keep only the top few. Strip boilerplate, navigation, and repeated headers before the text reaches the prompt. You will usually find you can cut half the context with no loss in answer quality. These reductions echo the practices in Token Budget Management and Optimization: Best Practices That Actually Work.
Step Four: Compress the History
Conversation history is the component that grows without bound, so it needs its own treatment.
Keep Recent Turns Verbatim
Detail matters most in the most recent exchanges. Keep the last few turns exactly as they were said so the model does not lose immediate context.
Summarize Older Turns
Replace older turns with a compact running summary that preserves decisions, established facts, and open questions. Regenerate the summary periodically as the conversation grows. This keeps history small while retaining what the model needs to stay coherent.
Set a Hard History Cap
Define a maximum number of tokens history is allowed to occupy. When the summary plus recent turns would exceed it, summarize more aggressively. A hard cap turns an unbounded cost into a predictable one.
Step Five: Enforce and Verify
A reduction that is not enforced will drift back, and a reduction that breaks quality is not a win.
Move Limits Into Configuration
Put every cap β system prompt budget, context budget, history budget, output budget β in one configuration location. This makes the whole budget visible and tunable rather than scattered across the code.
Re-run the Baseline Sample
Send the same representative sample through the optimized path and compare token counts and cost per request against your original numbers. Confirm the reduction is real and measure its size.
Check Quality Did Not Regress
Compare a set of answers before and after. If the optimized version is producing worse answers, you cut something the model needed β restore it and find savings elsewhere. The trade-off between cost and quality is exactly what makes this work engineering rather than arithmetic, a point explored in Case Study: Token Budget Management and Optimization in Practice.
Step Six: Lock In the Gains
Finishing the routine once is satisfying, but the savings will not survive on their own. A short final step turns a one-time cleanup into a durable change.
Write Down the New Budget
Record the limits you settled on and why β the output cap, the history budget, the retrieval limit β next to the feature in your documentation. The next person to touch this code, possibly you in six months, needs to know the limits are deliberate and what would break if they are raised carelessly.
Add a Guard at Review Time
If your team reviews code, add a note that changes to this prompt should respect the documented budget. A reviewer catching a reintroduced full-document retrieval or a removed output cap is far cheaper than discovering it on a bill weeks later. The cost of the check is a sentence in a review; the cost of missing it is open-ended.
Schedule the Next Pass
Put a recurring reminder to re-run this routine against the feature, monthly for active ones. Traffic shifts, prompts accumulate, and a feature that was lean in spring drifts by autumn. A scheduled pass keeps the drift small and catches new waste while it is still cheap to fix. The recurring-review habit is captured as a tool in The Token Budget Management and Optimization Checklist for 2026.
Frequently Asked Questions
Where should I start if everything looks expensive?
Start with measurement, then attack the single largest component first. Optimizing the biggest consumer gives the largest return for the same effort, and the baseline tells you which component that is.
How do I cap output length?
Almost every API exposes a maximum output tokens parameter. Set it to a value appropriate for the answers your feature should produce. This is usually a one-line change and one of the highest-leverage steps in the routine.
Will trimming retrieved context hurt accuracy?
It can if you trim relevant material, which is why you rerank by relevance and keep the top passages rather than cutting blindly. Verify against your sample afterward to confirm quality held.
How often should I repeat this routine?
Re-run it whenever a feature's cost grows faster than its usage, and on a regular schedule β monthly is reasonable for active features. Prompts accumulate cruft and traffic patterns shift, so periodic passes keep budgets honest.
What if quality drops after optimizing?
Restore the component you most recently cut and confirm quality recovers. Then look for savings in a different component. Quality regressions almost always trace to removing context the model genuinely needed.
Key Takeaways
- Capture a per-component token baseline on a representative sample before changing anything.
- Identify the largest consumer and start there; variable components like history and retrieval usually offer the most.
- Cut the cheapest waste first β prune the system prompt, cap output length, and trim and rerank retrieved context.
- Compress history by keeping recent turns verbatim, summarizing older ones, and enforcing a hard token cap.
- Move all limits into configuration, re-run the baseline to confirm savings, and verify quality did not regress.