Almost everyone who builds with language models eventually hits the same wall. A prototype that felt cheap in a notebook turns into a line item that nobody budgeted for once real traffic arrives. The questions that follow are predictable, and they are almost always asked in a panic rather than during design. That is the wrong time to learn the answers.
This article collects the questions we hear most often from teams trying to bring their token spend under control. The goal is not to make you an expert in tokenizer internals. It is to give you accurate, usable answers to the things that actually change a bill: where tokens come from, what you can cut without hurting output quality, and how to reason about the tradeoffs instead of guessing.
Read it top to bottom or jump to the question that is keeping you up. Either way, you should walk away able to make a decision rather than collect more opinions. If you want the full reference instead of the highlights, the Complete Guide to Token Budget Management and Optimization covers the territory end to end.
What Actually Counts as a Token
A token is a chunk of text the model processes as a unit. It is not a word and not a character. Common English words are often a single token, while rare words, code, and punctuation can split into several.
The Practical Rule
For English prose, you can estimate roughly four characters per token, or about three-quarters of a word per token. That estimate is good enough for planning. It falls apart for code, JSON, non-Latin scripts, and heavily formatted text, all of which tend to tokenize more densely.
Why It Matters
Every byte you send and receive is metered. That includes the system prompt, the conversation history, the documents you paste in, and the model's reply. People consistently underestimate the input side because the prompt feels free once it is written. It is not. A 2,000-token system prompt sent on every request is a recurring cost, not a one-time one.
Where Does My Spend Actually Go
Most teams assume the model's output is the expensive part. Usually it is the opposite.
Input Dominates More Often Than You Think
In retrieval-heavy and agentic workloads, input tokens frequently outnumber output tokens by ten to one or more. Long system prompts, retrieved chunks, full conversation transcripts, and tool schemas all pile onto the input side of every single call.
The Usual Culprits
- Uncapped conversation history resent in full on every turn
- Oversized retrieval that returns twenty chunks when three would answer the question
- Verbose system prompts that repeat instructions the model already follows
- Few-shot examples left in place long after the model stopped needing them
If you only audit one thing, audit what you are sending, not what you are getting back.
Should I Use a Cheaper Model or a Smaller Prompt
This is the most common false choice. You can usually do both, and they solve different problems.
Model Choice Sets the Floor
A smaller or cheaper model lowers the per-token rate. That helps every request uniformly but caps the difficulty of tasks you can handle reliably. Routing simple requests to a cheap model and hard ones to a capable model is one of the highest-leverage moves available.
Prompt Size Sets the Volume
Trimming the prompt lowers how many tokens you pay for at whatever rate you are charged. A bloated prompt on a cheap model can still cost more than a tight prompt on an expensive one. Treat rate and volume as separate dials.
How Do I Cut Tokens Without Hurting Quality
The fear is that every cut degrades output. In practice, a lot of token spend buys nothing.
Start With Dead Weight
Remove instructions the model already obeys, duplicate context, and stale examples. Test the change against a fixed set of real inputs. If quality holds, the tokens were waste.
Summarize Instead of Replaying
For long conversations, replace old turns with a running summary. You keep the thread of the discussion while collapsing thousands of tokens into a few hundred.
Tighten Retrieval
Return fewer, better chunks. Re-ranking and tighter chunk sizes often improve answers and cut tokens at the same time, because the model is no longer wading through irrelevant text.
For a structured approach to these moves, the Token Budget Management and Optimization Playbook lays out which lever to pull and when.
Do Caching and Batching Really Help
Yes, and they are underused because they require a small change in how you structure requests.
Prompt Caching
If a large, stable block of context is reused across requests, prompt caching lets you pay full price once and a steep discount thereafter. Put the stable content at the front and the variable content at the end so the cached prefix stays intact.
Batching
For non-interactive work like overnight summarization or evaluation runs, batch APIs trade latency for a meaningful discount. If a job does not need an instant answer, batching it is free money.
How Should I Measure and Set a Budget
You cannot manage what you do not measure, and most teams do not measure tokens per request until something breaks.
Track Cost Per Outcome
Raw token counts are noisy. Tie spend to a unit that matters to the business: cost per resolved ticket, per generated draft, per qualified lead. That framing makes tradeoffs legible to people who do not care about tokens.
Set Caps Before You Need Them
Define a maximum context size and a maximum output length per use case. Caps prevent the slow creep that turns a healthy prototype into an expensive surprise. The repeatable workflow shows how to bake these limits into a process rather than relying on memory.
What About Streaming, Latency, and the User Experience
Cost is not the only reason to care about tokens. Token count drives latency too, and the two concerns often pull in the same direction.
Fewer Tokens Usually Means Faster Replies
The model has to process every input token before it starts generating, and generate every output token one at a time. A leaner prompt and a tighter expected output both shorten the wait. Optimizing tokens frequently improves perceived speed as a side effect, which is a rare case where the cheap choice is also the better experience.
When To Cap Output Length
If your use case produces long answers that users skim anyway, capping the output length saves money and reduces latency without hurting much. Ask whether the extra length is being read or ignored. A summarizer that returns three paragraphs when one would do is paying twice for output nobody finishes.
The Tradeoff To Watch
Cutting too aggressively can truncate genuinely useful detail. The right move is to test caps against real expectations, not to pick a number that feels frugal. Let the evaluation set, not your wallet, decide where the line sits.
How Do I Decide What To Optimize First
Teams often spread effort evenly across everything, which wastes the most time on the lowest-value targets.
Follow the Money
Sort your use cases by total spend and start at the top. The use case that accounts for half your bill deserves more attention than the dozen that share the other half. A small percentage cut on your largest line item beats a large cut on a trivial one.
Prefer Reversible Changes
Begin with changes you can undo quickly, like trimming a prompt or capping retrieval. Save structural changes, like re-architecting how conversation state is assembled, for when the easy wins are exhausted. This keeps risk low while you learn how your system responds. The best practices that actually work collection ranks these moves by leverage so you know where to start.
Frequently Asked Questions
Is it worth optimizing tokens at low volume?
At a few hundred requests a day, probably not for cost reasons alone. But the habits you build at low volume pay off when traffic grows, and tighter prompts often run faster and more reliably. Optimize structure early, optimize aggressively later.
Will trimming my system prompt make the model dumber?
Only if you remove instructions the model actually needs. Much of what sits in long system prompts is redundant or aspirational. Cut, test against real inputs, and keep what demonstrably moves quality.
Does counting tokens require a special library?
You can count exactly with the tokenizer that matches your model, which most providers publish. For planning, the four-characters-per-token estimate is close enough. Use exact counts when you are enforcing hard caps.
How do input and output prices compare?
Output tokens usually cost more per token than input tokens, but input volume is often far larger, so input frequently dominates total spend. Check both rates and both volumes before deciding where to focus.
Can I just raise my spending limit and move on?
You can, until the next jump in traffic forces the question again. Raising the limit treats the symptom. Measuring cost per outcome and capping context treats the cause, and it scales.
Key Takeaways
- A token is a sub-word unit; estimate four characters per token for prose and count exactly when enforcing caps.
- Input tokens usually dominate spend, so audit what you send before what you receive.
- Model choice sets the rate and prompt size sets the volume; tune both independently.
- The cheapest cuts are dead weight: redundant instructions, stale examples, and oversized retrieval.
- Prompt caching and batching offer real discounts for stable context and non-interactive jobs.
- Measure cost per business outcome and set context and output caps before traffic forces the issue.