TIER: A Reusable Way to Reason About AI Spend

Optimizing AI cost one decision at a time works until you have a dozen workloads and no consistent way to reason about them. What you need is a framework — a reusable mental model that takes any workload and tells you which model tier, pricing structure, and optimizations fit, and in what order to apply them. This article introduces one we call TIER.

TIER stands for the four stages you walk through for any workload: Tier the task, Inventory the tokens, Establish the structure, and Refine in production. The value of a named framework is repeatability. Once your team internalizes these four stages, cost reasoning stops being a one-off heroic effort and becomes a routine part of how you design AI features.

We'll define each stage, explain what decision it produces, and note when the stage matters most. Use it as a checklist for your thinking, not a rigid sequence — though the order does reflect dependency, since each stage informs the next.

Stage 1: Tier the Task

The first decision, and the highest-leverage one, is which model tier the task actually needs. This comes first because it dominates everything downstream — the per-token rate you'll multiply by in later stages depends entirely on it.

Classify the task by genuine difficulty, not by ambition:

Trivial (classification, routing, tagging, short extraction) → smallest, cheapest model.
Moderate (summarization, standard generation, most production reasoning) → mid-tier model.
Hard (complex multi-step reasoning, long-horizon agents, high-stakes correctness) → flagship model.

The discipline here is resisting the reflex to over-provision. Test the task on a tier below your instinct; if quality holds, you've captured a 10x saving before doing anything else. This is the core lesson of our Best Practices article.

When this stage matters most

Always, but especially for high-volume workloads where the tier choice is multiplied across millions of requests. A wrong tier on a high-volume task is the single most expensive mistake in the framework.

Stage 2: Inventory the Tokens

With a tier chosen, count what each request actually consumes. This stage produces your per-request cost and reveals which side of the bill — input or output — dominates.

Inventory three things:

Stable input: instructions, system prompts, knowledge that repeats across requests. This is your caching target.
Variable input: retrieved context, conversation history, the user's actual data. This is your trimming target.
Output: how much the model generates. This is your constraint target, since output is billed at three to five times input.

Separating stable from variable input is the key insight of this stage, because the two get optimized differently — stable input is cached, variable input is trimmed. The mechanics of counting are in our Step-by-Step Approach.

Stage 3: Establish the Structure

Now choose the pricing structure that fits the workload's shape. This is where you decide between pay-as-you-go, batch, committed throughput, or self-hosting. The decision turns on two questions: how latency-sensitive is the workload, and how steady is its volume?

Variable volume, real-time → pay-as-you-go. The default; you pay only for what you use.
Any volume, latency-tolerant → batch processing at roughly half price.
High, steady volume, real-time → committed/provisioned throughput for predictable cost and latency.
Very high, steady volume, with infra capability → self-hosting an open model to eliminate per-token fees.

Most workloads land on pay-as-you-go or batch. The latter two are for mature, high-volume systems where the commitment pays off. The trade-offs play out concretely in our Real-World Examples.

The decisive question

If no human or real-time process waits on the result, batch is almost always correct. That single question resolves most structure decisions instantly.

Stage 4: Refine in Production

The first three stages are design-time; this one is continuous. Once live, you apply the optimizations the earlier stages identified and keep them tuned as reality diverges from your estimates.

The refinement loop:

Enable caching on the stable input from Stage 2, and verify it's actually hitting.
Trim variable input until quality begins to drop, then stop.
Constrain output to the minimum that meets the need.
Instrument spend by feature so the next two stages stay grounded in real data.
Re-tier periodically as cheaper models gain capability.

This stage never ends. Prices and models shift, your usage drifts, and new workloads arrive. Refinement is what keeps the earlier decisions valid over time, and it's the discipline that prevented disaster in our Case Study.

Applying TIER End to End

To see the framework in motion, take a support chatbot. Tier: question-answering over a knowledge base is moderate, so a mid-tier model. Inventory: a large stable system prompt (cache it), retrieved knowledge and history (trim them), and short answers (already constrained). Structure: users wait in real time, volume is variable, so pay-as-you-go. Refine: enable caching on the system prompt, trim retrieval to the top relevant chunks, monitor per-conversation cost.

Four stages, four decisions, one coherent cost design — produced in minutes once the framework is habit. That repeatability is the entire point.

Contrast that with an agent workload, where the framework produces a very different answer. Tier: multi-step autonomous reasoning is hard, so a flagship model for the difficult decisions — but inventory will reveal sub-steps that can drop to a smaller tier. Inventory: the stable system prompt is cacheable, but the variable input grows with every step as history accumulates, flagging summarization as the priority trim. Structure: an interactive agent runs in real time, so pay-as-you-go, unless it's a background agent, in which case batch reopens. Refine: cache the system prompt, summarize prior steps aggressively, and route routine sub-steps to a cheaper model. Same four stages, completely different cost design, because the framework adapts to the workload's shape rather than imposing a fixed recipe.

That adaptability is what separates a framework from a checklist. A checklist gives the same answer to every workload; TIER gives the right answer to each one, derived from its actual difficulty, token profile, and latency needs.

Frequently Asked Questions

Why is tiering the task the first stage?

Because the model tier sets the per-token rate that every later calculation multiplies, it dominates total cost more than any other single decision. Getting the tier right first means the inventory and structure stages operate on the correct rates. A wrong tier makes all the downstream optimization marginal by comparison.

What's the difference between trimming and caching?

They target different parts of your input. Caching applies to stable content that repeats across requests — system prompts, instructions — billing it at a discount. Trimming applies to variable content — retrieved context, history — by reducing how much you send. Stage 2 separates the two precisely so each gets the right treatment.

When should I consider self-hosting in the structure stage?

Only at very high, steady volume and with the infrastructure capability to run and scale GPUs reliably. Self-hosting trades per-token fees for fixed operational cost, which pays off only above a meaningful utilization threshold. For most workloads, pay-as-you-go or batch is cheaper and far less operational burden.

Is the TIER order strict?

The order reflects dependency — each stage informs the next — so following it is sensible, but it's a thinking tool, not a rigid gate. In practice you'll loop back: production data from Stage 4 often sends you back to re-tier in Stage 1. Treat it as a cycle anchored by a logical starting order.

How is this different from just following a checklist?

A checklist tells you what to verify; the framework tells you how to reason and in what order, so you can handle novel workloads a checklist didn't anticipate. The two complement each other — use TIER to design and a checklist to verify the design before launch.

Key Takeaways

TIER is a four-stage framework: Tier the task, Inventory the tokens, Establish the structure, Refine in production.
Tiering the task first captures the largest savings, since model choice dominates cost.
Inventorying tokens separates stable input (cache it) from variable input (trim it) and output (constrain it).
Structure choice turns on latency tolerance and volume steadiness; batch wins when nothing waits in real time.
Refinement is continuous — prices, models, and usage drift, so optimizations need ongoing tuning.
Once habitual, TIER turns cost design from a heroic effort into a routine, repeatable process.

Stage 1: Tier the Task

Classify the task by genuine difficulty, not by ambition:

Trivial (classification, routing, tagging, short extraction) → smallest, cheapest model.
Moderate (summarization, standard generation, most production reasoning) → mid-tier model.
Hard (complex multi-step reasoning, long-horizon agents, high-stakes correctness) → flagship model.

When this stage matters most

Stage 2: Inventory the Tokens

With a tier chosen, count what each request actually consumes. This stage produces your per-request cost and reveals which side of the bill — input or output — dominates.

Inventory three things:

Stable input: instructions, system prompts, knowledge that repeats across requests. This is your caching target.
Variable input: retrieved context, conversation history, the user's actual data. This is your trimming target.
Output: how much the model generates. This is your constraint target, since output is billed at three to five times input.

Stage 3: Establish the Structure

Variable volume, real-time → pay-as-you-go. The default; you pay only for what you use.
Any volume, latency-tolerant → batch processing at roughly half price.
High, steady volume, real-time → committed/provisioned throughput for predictable cost and latency.
Very high, steady volume, with infra capability → self-hosting an open model to eliminate per-token fees.

Most workloads land on pay-as-you-go or batch. The latter two are for mature, high-volume systems where the commitment pays off. The trade-offs play out concretely in our Real-World Examples.

The decisive question

If no human or real-time process waits on the result, batch is almost always correct. That single question resolves most structure decisions instantly.

Stage 4: Refine in Production

The first three stages are design-time; this one is continuous. Once live, you apply the optimizations the earlier stages identified and keep them tuned as reality diverges from your estimates.

The refinement loop:

Enable caching on the stable input from Stage 2, and verify it's actually hitting.
Trim variable input until quality begins to drop, then stop.
Constrain output to the minimum that meets the need.
Instrument spend by feature so the next two stages stay grounded in real data.
Re-tier periodically as cheaper models gain capability.

Applying TIER End to End

Four stages, four decisions, one coherent cost design — produced in minutes once the framework is habit. That repeatability is the entire point.

Frequently Asked Questions

Why is tiering the task the first stage?

What's the difference between trimming and caching?

When should I consider self-hosting in the structure stage?

Is the TIER order strict?

How is this different from just following a checklist?

Key Takeaways

TIER is a four-stage framework: Tier the task, Inventory the tokens, Establish the structure, Refine in production.
Tiering the task first captures the largest savings, since model choice dominates cost.
Inventorying tokens separates stable input (cache it) from variable input (trim it) and output (constrain it).
Structure choice turns on latency tolerance and volume steadiness; batch wins when nothing waits in real time.
Refinement is continuous — prices, models, and usage drift, so optimizations need ongoing tuning.
Once habitual, TIER turns cost design from a heroic effort into a routine, repeatable process.

TIER: A Reusable Way to Reason About AI Spend

Stage 1: Tier the Task

When this stage matters most

Stage 2: Inventory the Tokens

Stage 3: Establish the Structure

The decisive question

Stage 4: Refine in Production

Applying TIER End to End

Frequently Asked Questions

Why is tiering the task the first stage?

What's the difference between trimming and caching?

When should I consider self-hosting in the structure stage?

Is the TIER order strict?

How is this different from just following a checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

TIER: A Reusable Way to Reason About AI Spend

Stage 1: Tier the Task

When this stage matters most

Stage 2: Inventory the Tokens

Stage 3: Establish the Structure

The decisive question

Stage 4: Refine in Production

Applying TIER End to End

Frequently Asked Questions

Why is tiering the task the first stage?

What's the difference between trimming and caching?

When should I consider self-hosting in the structure stage?

Is the TIER order strict?

How is this different from just following a checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?