Same Token Rate, Five Wildly Different Bills

Pricing rules stay abstract until you watch them play out on a real workload. The same per-token rate can produce a trivial bill or a runaway one depending entirely on the shape of the application — how big the prompts are, how long the answers run, how often it's called, and whether anyone bothered to optimize.

This article walks through five concrete use cases, each with a different cost profile, and shows where the money actually goes. For each, we describe the workload, identify the dominant cost driver, and note what made it cheap or expensive. The numbers are illustrative ranges, not quotes from a specific vendor, but the relationships between them are exactly what you'll see in practice.

Read these as patterns to recognize in your own projects. Once you can spot which cost driver dominates a given workload, the right optimization becomes obvious.

Use Case 1: A Customer Support Chatbot

A chatbot answers customer questions using a knowledge base. Each turn sends a system prompt (instructions plus retrieved knowledge, maybe 3,000 tokens), some conversation history (1,000 tokens), and the user's message (50 tokens). The model replies with a few hundred tokens.

Dominant cost driver: Input, specifically the repeated system prompt and knowledge. The user types almost nothing, but every turn carries thousands of tokens of context.

What made it work: Prompt caching on the stable system prompt and knowledge base. Because that 3,000-token prefix is identical across turns, caching it at a steep discount cut the chatbot's bill by more than half. Without caching, this workload is needlessly expensive. The mechanics are covered in our Complete Guide.

Use Case 2: A High-Volume Document Classifier

A pipeline labels incoming documents into categories — say, routing support tickets or tagging content. Each request sends a short instruction and one document (around 800 tokens) and gets back a single label (5 tokens).

Dominant cost driver: Volume. Each request is cheap, but the pipeline runs hundreds of thousands of times.

What made it work: Two things. First, this task is simple, so it runs on a small model at a fraction of the flagship price — using a top-tier model here would have multiplied the bill 10x or more for no quality gain. Second, classification rarely needs a real-time answer, so moving it to a batch API halved the rate again. A small model on batch is the textbook cheap-and-correct setup. This is exactly the kind of mistake flagged in our Common Mistakes guide.

Use Case 3: An Autonomous Agent

An agent completes multi-step tasks — researching, calling tools, reasoning across several turns. A single task might involve 15 model calls, each carrying the growing history of everything the agent has done so far.

Dominant cost driver: Accumulating context across steps, multiplied by reasoning that demands a capable model.

What made it expensive: Agents are the costliest common pattern because two expensive factors compound. The context grows with each step, so later calls are large, and the reasoning often requires a flagship model. A naive agent can cost dollars per task.

What helped: Caching the stable parts of the agent's context, summarizing earlier steps rather than carrying them verbatim, and using a smaller model for the routine sub-steps while reserving the flagship for the genuinely hard decisions. Even so, agents demand the most careful cost design of anything here.

Use Case 4: A Nightly Enrichment Pipeline

A batch job runs every night, enriching a database — generating summaries, extracting structured fields, or scoring records. Volume is high but nothing is time-sensitive; results are needed by morning, not instantly.

Dominant cost driver: Total volume, but with maximum flexibility on latency.

What made it cheap: This is the ideal batch workload. Because nothing waits on the result in real time, the entire job runs on a batch API at roughly half price. Pair that with a model tier matched to the task difficulty and the cost is about as low as hosted AI gets for the volume. Workloads like this are why the Framework article emphasizes classifying jobs by latency tolerance.

Use Case 5: A RAG Search Assistant

A retrieval-augmented assistant answers questions over a large document corpus. For each query it retrieves relevant chunks and feeds them to the model. The quality is sensitive to how much context is retrieved.

Dominant cost driver: Retrieved context size, which directly inflates input tokens on every query.

What made it fail (then succeed): The first version retrieved 20 chunks per query "to be safe," ballooning input tokens and the bill. Measuring quality against chunk count revealed that 5 well-ranked chunks produced answers indistinguishable from 20. Cutting retrieval to those 5 chunks slashed input cost roughly fourfold with no quality loss — a direct illustration of treating context as a budget. Our Best Practices article expands on this discipline.

Reading the Pattern Across All Five

Notice the throughline: in every case, the dominant cost driver was different, and the right fix followed from identifying it. Chatbots are input-heavy, so caching wins. Classifiers are volume-heavy and simple, so small models and batch win. Agents compound context and difficulty, so they need the most care. Batch jobs are latency-flexible, so batch APIs win. RAG is context-sensitive, so retrieval discipline wins.

Diagnose the driver first, then optimize. Applying the wrong fix — caching a workload with no repeated content, or batching something a user is waiting on — wastes effort.

A quick way to find the driver: look at where the tokens concentrate. If a large prefix repeats on every call, you're input-heavy and caching wins. If each call is small but you make millions of them, you're volume-heavy and model tier plus batch win. If a few calls carry enormous context, you're context-heavy and trimming wins. If the model writes far more than it reads, you're output-heavy and constraints win. The dominant driver is almost always visible in a single representative request, which is why estimating that one request — as our other articles stress — pays off so reliably. Spend a few minutes reading your own token profile before reaching for any optimization, and you'll apply the right one the first time.

Frequently Asked Questions

Why are agents so much more expensive than chatbots?

Agents compound two expensive factors: context that grows with every step, and reasoning that often requires a flagship model. A single agent task can involve a dozen or more calls, each larger than the last. Chatbots, by contrast, have a stable cacheable prefix and usually one call per turn, making them far cheaper to run.

When is a batch API the obvious choice?

Whenever no human or real-time process is waiting on the result. Nightly enrichment, bulk classification, and offline document processing all qualify. The output is identical to the real-time API; you simply trade instant latency for roughly half the price, which is free savings for background work.

How did reducing retrieved chunks save so much in the RAG example?

Because every retrieved chunk is input tokens billed on every query. Retrieving 20 chunks instead of 5 means roughly four times the context cost per query. When quality testing showed 5 well-ranked chunks were as good as 20, the extra 15 were pure waste, so cutting them cut input cost without harming answers.

Can these examples' numbers be applied to my workload directly?

Use the relationships, not the absolute numbers. The ranges here are illustrative, and real prices vary by provider and model. What transfers is the diagnosis: identify whether your workload is input-heavy, volume-heavy, context-heavy, or latency-flexible, and apply the matching optimization.

What's the cheapest possible workload shape?

A simple task, on a small model, run through a batch API, with minimal context. The nightly enrichment pipeline is close to this ideal. The more your workload resembles that profile — low difficulty, no real-time requirement, lean prompts — the lower its cost per unit of work.

Key Takeaways

Chatbots are input-heavy; prompt caching on the stable prefix is the winning move.
High-volume classifiers thrive on small models plus batch processing.
Agents are the costliest pattern because context and difficulty compound — design carefully.
Nightly pipelines are ideal batch workloads at roughly half price.
RAG cost is driven by retrieved context; trim chunks to the quality sweet spot.
Diagnose the dominant cost driver first, then apply the matching optimization.

Read these as patterns to recognize in your own projects. Once you can spot which cost driver dominates a given workload, the right optimization becomes obvious.

Use Case 1: A Customer Support Chatbot

Dominant cost driver: Input, specifically the repeated system prompt and knowledge. The user types almost nothing, but every turn carries thousands of tokens of context.

Use Case 2: A High-Volume Document Classifier

Dominant cost driver: Volume. Each request is cheap, but the pipeline runs hundreds of thousands of times.

Use Case 3: An Autonomous Agent

Dominant cost driver: Accumulating context across steps, multiplied by reasoning that demands a capable model.

Use Case 4: A Nightly Enrichment Pipeline

Dominant cost driver: Total volume, but with maximum flexibility on latency.

Use Case 5: A RAG Search Assistant

Dominant cost driver: Retrieved context size, which directly inflates input tokens on every query.

Reading the Pattern Across All Five

Diagnose the driver first, then optimize. Applying the wrong fix — caching a workload with no repeated content, or batching something a user is waiting on — wastes effort.

Frequently Asked Questions

Why are agents so much more expensive than chatbots?

When is a batch API the obvious choice?

How did reducing retrieved chunks save so much in the RAG example?

Can these examples' numbers be applied to my workload directly?

What's the cheapest possible workload shape?

Key Takeaways

Chatbots are input-heavy; prompt caching on the stable prefix is the winning move.
High-volume classifiers thrive on small models plus batch processing.
Agents are the costliest pattern because context and difficulty compound — design carefully.
Nightly pipelines are ideal batch workloads at roughly half price.
RAG cost is driven by retrieved context; trim chunks to the quality sweet spot.
Diagnose the dominant cost driver first, then apply the matching optimization.

Same Token Rate, Five Wildly Different Bills

Use Case 1: A Customer Support Chatbot

Use Case 2: A High-Volume Document Classifier

Use Case 3: An Autonomous Agent

Use Case 4: A Nightly Enrichment Pipeline

Use Case 5: A RAG Search Assistant

Reading the Pattern Across All Five

Frequently Asked Questions

Why are agents so much more expensive than chatbots?

When is a batch API the obvious choice?

How did reducing retrieved chunks save so much in the RAG example?

Can these examples' numbers be applied to my workload directly?

What's the cheapest possible workload shape?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Same Token Rate, Five Wildly Different Bills

Use Case 1: A Customer Support Chatbot

Use Case 2: A High-Volume Document Classifier

Use Case 3: An Autonomous Agent

Use Case 4: A Nightly Enrichment Pipeline

Use Case 5: A RAG Search Assistant

Reading the Pattern Across All Five

Frequently Asked Questions

Why are agents so much more expensive than chatbots?

When is a batch API the obvious choice?

How did reducing retrieved chunks save so much in the RAG example?

Can these examples' numbers be applied to my workload directly?

What's the cheapest possible workload shape?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?