AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Step One: Capture a BaselineInstrument the Prompt AssemblyRun a Representative SampleNote the Cost Per RequestStep Two: Identify the Largest ConsumersSort Components by SizeSeparate Fixed From VariableCheck the Output SideStep Three: Cut the Cheapest Waste FirstPrune the System PromptCap the Output LengthTrim and Rerank Retrieved ContextStep Four: Compress the HistoryKeep Recent Turns VerbatimSummarize Older TurnsSet a Hard History CapStep Five: Enforce and VerifyMove Limits Into ConfigurationRe-run the Baseline SampleCheck Quality Did Not RegressStep Six: Lock In the GainsWrite Down the New BudgetAdd a Guard at Review TimeSchedule the Next PassFrequently Asked QuestionsWhere should I start if everything looks expensive?How do I cap output length?Will trimming retrieved context hurt accuracy?How often should I repeat this routine?What if quality drops after optimizing?Key Takeaways
Home/Blog/Cut Your Token Costs This Afternoon: An Ordered Routine
General

Cut Your Token Costs This Afternoon: An Ordered Routine

A

Agency Script Editorial

Editorial Team

Β·September 8, 2022Β·8 min read
token budget management and optimizationtoken budget management and optimization how totoken budget management and optimization guideprompt engineering

Reading about token management is one thing. Sitting down with a live feature that costs too much and fixing it is another. This guide is the second thing. It is an ordered routine you can run today against a real prompt, moving step by step from measurement to enforced limits, without relying on intuition about where the tokens are going.

The routine works on any LLM feature: a chatbot, a retrieval-augmented question answerer, a summarizer, a classification pipeline. The specifics of each step change with the feature, but the sequence does not. Measure first, find the largest consumers, cut the cheapest waste, enforce the new limits, and verify you did not break quality. Skip a step and you risk optimizing the wrong thing or shipping a regression.

Have your codebase open and a representative sample of real prompts on hand. The whole routine takes an afternoon for a single feature, and most of that time is measurement and verification rather than code changes.

Step One: Capture a Baseline

You cannot tell whether you improved anything if you never measured where you started.

Instrument the Prompt Assembly

Find the place in your code where the final prompt is built. Right before the request is sent, count the tokens in each component separately β€” system prompt, retrieved context, conversation history, and user message β€” using your provider's tokenizer. Log those numbers.

Run a Representative Sample

Send a few dozen real or realistic requests through the instrumented path. Record the per-component token counts and the output token count for each. You now have a profile of where tokens actually go, which is almost always different from where you assumed they went.

Note the Cost Per Request

Multiply input and output tokens by their respective prices and write down the average cost per request. This is the number every later step will be judged against. The broader rationale for starting here is laid out in Spending Tokens Like Money: A Working Manual for LLM Budgets.

Step Two: Identify the Largest Consumers

With a baseline in hand, the optimization targets pick themselves.

Sort Components by Size

Rank your four components by average token count. In most retrieval systems the retrieved context dominates. In long chat sessions the history dominates. In simple features a bloated system prompt is often the culprit. Whatever leads the list is where you start.

Separate Fixed From Variable

Some components are fixed per request, like the system prompt. Others grow with use, like history or retrieved documents. Variable components are usually the better target because their growth is what makes costs unpredictable over time.

Check the Output Side

Look at the distribution of output lengths, not just the average. If a minority of responses run very long, capping output length will save more than any input change. Output is usually the pricier side, so a small cut there pays well.

Step Three: Cut the Cheapest Waste First

Not all reductions cost the same effort. Start with the ones that are nearly free.

Prune the System Prompt

Read your system prompt line by line and delete anything that no longer changes behavior. Old instructions added for problems that no longer exist are pure waste paid on every request. This is the single fastest win available.

Cap the Output Length

Set a maximum output token count appropriate to the feature. If answers should be a paragraph, do not allow space for an essay. This prevents runaway generations immediately and predictably.

Trim and Rerank Retrieved Context

If you include retrieved documents, rerank them by relevance and keep only the top few. Strip boilerplate, navigation, and repeated headers before the text reaches the prompt. You will usually find you can cut half the context with no loss in answer quality. These reductions echo the practices in Token Budget Management and Optimization: Best Practices That Actually Work.

Step Four: Compress the History

Conversation history is the component that grows without bound, so it needs its own treatment.

Keep Recent Turns Verbatim

Detail matters most in the most recent exchanges. Keep the last few turns exactly as they were said so the model does not lose immediate context.

Summarize Older Turns

Replace older turns with a compact running summary that preserves decisions, established facts, and open questions. Regenerate the summary periodically as the conversation grows. This keeps history small while retaining what the model needs to stay coherent.

Set a Hard History Cap

Define a maximum number of tokens history is allowed to occupy. When the summary plus recent turns would exceed it, summarize more aggressively. A hard cap turns an unbounded cost into a predictable one.

Step Five: Enforce and Verify

A reduction that is not enforced will drift back, and a reduction that breaks quality is not a win.

Move Limits Into Configuration

Put every cap β€” system prompt budget, context budget, history budget, output budget β€” in one configuration location. This makes the whole budget visible and tunable rather than scattered across the code.

Re-run the Baseline Sample

Send the same representative sample through the optimized path and compare token counts and cost per request against your original numbers. Confirm the reduction is real and measure its size.

Check Quality Did Not Regress

Compare a set of answers before and after. If the optimized version is producing worse answers, you cut something the model needed β€” restore it and find savings elsewhere. The trade-off between cost and quality is exactly what makes this work engineering rather than arithmetic, a point explored in Case Study: Token Budget Management and Optimization in Practice.

Step Six: Lock In the Gains

Finishing the routine once is satisfying, but the savings will not survive on their own. A short final step turns a one-time cleanup into a durable change.

Write Down the New Budget

Record the limits you settled on and why β€” the output cap, the history budget, the retrieval limit β€” next to the feature in your documentation. The next person to touch this code, possibly you in six months, needs to know the limits are deliberate and what would break if they are raised carelessly.

Add a Guard at Review Time

If your team reviews code, add a note that changes to this prompt should respect the documented budget. A reviewer catching a reintroduced full-document retrieval or a removed output cap is far cheaper than discovering it on a bill weeks later. The cost of the check is a sentence in a review; the cost of missing it is open-ended.

Schedule the Next Pass

Put a recurring reminder to re-run this routine against the feature, monthly for active ones. Traffic shifts, prompts accumulate, and a feature that was lean in spring drifts by autumn. A scheduled pass keeps the drift small and catches new waste while it is still cheap to fix. The recurring-review habit is captured as a tool in The Token Budget Management and Optimization Checklist for 2026.

Frequently Asked Questions

Where should I start if everything looks expensive?

Start with measurement, then attack the single largest component first. Optimizing the biggest consumer gives the largest return for the same effort, and the baseline tells you which component that is.

How do I cap output length?

Almost every API exposes a maximum output tokens parameter. Set it to a value appropriate for the answers your feature should produce. This is usually a one-line change and one of the highest-leverage steps in the routine.

Will trimming retrieved context hurt accuracy?

It can if you trim relevant material, which is why you rerank by relevance and keep the top passages rather than cutting blindly. Verify against your sample afterward to confirm quality held.

How often should I repeat this routine?

Re-run it whenever a feature's cost grows faster than its usage, and on a regular schedule β€” monthly is reasonable for active features. Prompts accumulate cruft and traffic patterns shift, so periodic passes keep budgets honest.

What if quality drops after optimizing?

Restore the component you most recently cut and confirm quality recovers. Then look for savings in a different component. Quality regressions almost always trace to removing context the model genuinely needed.

Key Takeaways

  • Capture a per-component token baseline on a representative sample before changing anything.
  • Identify the largest consumer and start there; variable components like history and retrieval usually offer the most.
  • Cut the cheapest waste first β€” prune the system prompt, cap output length, and trim and rerank retrieved context.
  • Compress history by keeping recent turns verbatim, summarizing older ones, and enforcing a hard token cap.
  • Move all limits into configuration, re-run the baseline to confirm savings, and verify quality did not regress.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification