Most "best practices" lists for AI cost are a wall of platitudes: monitor your usage, choose the right model, optimize your prompts. True, useless, and impossible to act on. This article is the opinionated version. Every practice here comes with the reasoning behind it, a sense of how much it's worth, and an honest note on when it doesn't apply.
These are ordered by real-world payoff, not by how clever they sound. If you only do the first three, you'll capture most of the available savings. The later ones are for teams squeezing the last 20 percent out of a mature workload.
A note on philosophy: cost optimization is not about being cheap. It's about not paying for capability you aren't using. A small model that solves your problem is not a compromise β it's the correct tool, and paying ten times more for an indistinguishable result is the actual mistake.
Practice 1: Default to the Smallest Model That Works
The biggest savings in the entire field come from one habit: starting with a small model and only upgrading when quality demonstrably fails. The price spread within a model family is enormous, often 10x to 30x, and most production tasks β classification, extraction, routing, simple summarization β run perfectly well on the cheap end.
The reasoning is that capability is not free, and you pay for the model's full power on every request whether or not the task needs it. Treat model selection as a per-task decision, not a global one.
When to ignore it: Genuinely hard reasoning, long-horizon agents, or anything where a wrong answer is expensive enough to justify the premium. For those, the flagship earns its price.
Practice 2: Cache Everything Stable
If any part of your prompt repeats across requests β and for chatbots and agents, a large part almost always does β cache it. Prompt caching bills the repeated prefix at 75 to 90 percent off, and for workloads with a big stable system prompt, it routinely cuts total spend by 40 to 70 percent.
The reasoning is simple: you're sending the same instructions and knowledge thousands of times, and there's no reason to pay full price for identical text. Structure your prompts so the stable content comes first and the variable content comes last, which maximizes what can be cached.
When to ignore it: Workloads where every prompt is genuinely unique with no shared prefix. There, caching adds complexity for no gain.
Practice 3: Treat Context as a Budget, Not a Bucket
Every token in your context window is billed on every request. The instinct to "add more context to improve quality" is a slow, compounding cost leak. Discipline here means retrieving the fewest, most relevant chunks and summarizing rather than appending conversation history.
The reasoning is that quality improvements from context are sharply diminishing β the first few relevant chunks do most of the work, and the rest is expensive padding. Measure quality as you trim and you'll usually find a sweet spot far smaller than your default. Our Common Mistakes guide shows how badly this leak compounds when ignored.
Concrete context tactics
- Retrieve fewer chunks and rank them by relevance, not recency.
- Summarize old conversation turns instead of carrying them verbatim.
- Drop boilerplate from retrieved documents before sending.
Practice 4: Constrain Output Aggressively
Output costs three to five times more than input, which makes it the most expensive thing your model produces. Set explicit maximum token limits, instruct the model to be concise, and prefer structured outputs over prose wherever a machine consumes the result.
The reasoning is that unconstrained models are verbose by nature β they restate the question, add preambles, and explain themselves. None of that is free, and most of it is unnecessary. A tight output specification removes the waste without harming the useful content.
Practice 5: Move Tolerant Work to Batch
Any workload that doesn't need an instant response β nightly enrichment, bulk classification, offline document processing β belongs on a batch API at roughly half the standard rate. The output is identical; only latency changes.
The reasoning is that you're often paying a real-time premium for work that no human is waiting on. Audit your pipelines for anything that runs on a schedule or in the background and shift it. This is pure savings with no quality trade-off. The Framework article helps you decide which workloads qualify.
Practice 6: Instrument Spend Before You Need To
Log token counts, model used, and a feature tag on every request from day one. The reasoning is that you cannot optimize what you cannot see, and the moment a bill spikes is the worst time to start building observability.
Set budget alerts at 70 percent so you have time to react before overspending, not after. Per-feature attribution turns a scary total into a clear map of where money goes and which optimization will pay off most.
Practice 7: Re-evaluate Quarterly
Model prices and capabilities move fast. A model that was your flagship choice last quarter may now be matched by something cheaper, or a new small model may now handle a task that previously needed a mid-tier. The reasoning is that the price-performance frontier shifts, and a workload you optimized six months ago may be leaving money on the table today.
Put a recurring calendar reminder to re-run your cost estimates and test newer, cheaper models against your existing tasks. Our Checklist is a useful tool for this periodic review.
Sequencing These Practices
If you're starting from an unoptimized workload, apply these in order of payoff rather than all at once, so you can measure each effect cleanly. Begin with model tier β it's the largest lever and the change most likely to surprise you. Add caching next, since it's low-effort and high-return. Then trim context and constrain output, measuring quality as you go. Move batchable work last among the active changes. Throughout, lean on instrumentation so each step's saving is attributable. Doing them sequentially rather than in a single sweeping refactor means that if quality dips, you know exactly which change caused it and can dial it back without unwinding everything.
Frequently Asked Questions
If I only adopt one practice, which should it be?
Default to the smallest model that works. The price spread within a model family is so large β often 10x to 30x β that correct model selection captures more savings than every other practice combined. Start small, measure quality, and upgrade only where it genuinely fails.
Does aggressive cost optimization hurt quality?
Done correctly, no. The goal is to stop paying for capability you aren't using, not to ship worse results. Each practice here is paired with a quality check: test the smaller model, measure before and after trimming context, verify constrained output still meets the need.
How much can these practices save in total?
For a typical unoptimized workload, combining smaller models, caching, context trimming, and output limits commonly reduces spend by half to two-thirds. The exact figure depends on your starting point, but most teams that have never optimized are leaving very large savings unclaimed.
Why re-evaluate quarterly instead of setting it and forgetting it?
Because the price-performance frontier moves quickly. New models appear, prices drop, and capabilities that once required a flagship migrate down to cheaper tiers. A workload optimized six months ago may now be over-provisioned. A quarterly review keeps you on the current frontier.
Is prompt caching worth the added complexity?
For any workload with a stable, repeated prefix β which describes most chatbots and agents β yes, emphatically. The complexity is minor, usually a parameter on your existing calls, and the payoff is a 40 to 70 percent cut in spend. Only fully unique-prompt workloads should skip it.
Key Takeaways
- Default to the smallest model that works; it's the largest single saving available.
- Cache every stable prefix to bill repeated content at a steep discount.
- Treat context as a budget β diminishing quality returns make extra context expensive padding.
- Constrain output, which is billed at three to five times the input rate.
- Move latency-tolerant work to half-price batch APIs.
- Instrument spend from day one and re-evaluate against newer, cheaper models quarterly.