Latency Is Not a Knob You Poke at After Launch

Most teams treat inference latency as a knob to turn after launch. Something feels slow, someone files a ticket, an engineer pokes at batch sizes for an afternoon, and the number gets a little better. That works until it doesn't, and when it stops working you are usually staring at a churned enterprise account or a support queue full of "the assistant froze again."

A playbook fixes the timing. Instead of reacting to the slow week, you run a fixed set of plays with defined triggers and a single accountable owner per play. The point is not to memorize every optimization technique. The point is to know which move to make, when to make it, and who pulls the trigger so latency never becomes an emergency.

This is the operating layer that sits on top of the mechanics. If you need the mechanics themselves, The Complete Guide to AI Inference and Latency covers them. Here we assume you know what a token-per-second number means and focus on running the system.

How to Read This Playbook

Every play below follows the same shape: a trigger that tells you to run it, an owner who is accountable, and a sequence of moves. Treat the triggers as thresholds you actually wire into dashboards, not vibes. A play with no measurable trigger is a wish.

The three latency numbers that drive every play

Before you run anything, instrument these three and nothing else for triage:

Time to first token (TTFT) — how long the user waits before anything appears. This is the perceived-speed number.
Inter-token latency — the gap between streamed tokens. Smoothness lives here.
End-to-end p95 and p99 — total request time at the tail, not the average. Averages hide the requests that actually make people angry.

If you only watch averages, you will ship a system that feels fine in demos and falls apart for the 5 percent of users on a bad route or a long prompt.

Play 1: The Routing Play

Trigger: p95 TTFT exceeds your target (commonly 800ms to 1.2s for chat) for two consecutive hours, or a single model is carrying more than 80 percent of traffic it doesn't need.

Owner: Platform engineer on call.

Not every request needs your largest model. The routing play sends short, simple, or low-stakes requests to a smaller, faster model and reserves the heavyweight model for requests that genuinely need it. A classification step or a length heuristic decides the route.

The trade-off is real: aggressive routing saves latency and cost but risks quality regressions on edge cases that look simple but aren't. Mitigate it by logging routed requests and sampling them for quality weekly. The failure mode to watch is silent degradation, where everything is fast and nobody notices answers got worse.

Play 2: The Caching Play

Trigger: You see repeated or near-repeated prompts, or your system prompt is large and stable across requests.

Owner: Application engineer who owns the prompt layer.

Caching is the highest-leverage latency move most teams skip. There are two tiers:

Prompt/prefix caching — providers cache the processed system prompt and reuse it, cutting TTFT dramatically when your prefix is large and constant. Structure prompts so the stable part comes first.
Semantic response caching — for requests that repeat in meaning, return a stored answer instead of re-running inference. This is the move behind a lot of the wins in AI Inference and Latency: Real-World Examples and Use Cases.

The failure mode is staleness. A cached answer that's wrong because the underlying data changed is worse than a slow correct one. Set a TTL tied to how fast your data moves, and invalidate on writes.

Play 3: The Batching and Concurrency Play

Trigger: GPU utilization is low while a queue is forming, or throughput plateaus under load.

Owner: Infrastructure engineer.

If you self-host, continuous (in-flight) batching lets the server pack multiple requests through the GPU without making the first arrival wait for a full batch. This is the single biggest throughput lever for self-hosted inference. If you're on a managed API, the equivalent is tuning client-side concurrency so you saturate the provider's throughput without hitting rate limits.

The trade-off is the classic latency-versus-throughput tension. Bigger batches raise throughput and per-request latency at the same time. Pick the side that matches your product: a chat UX favors latency, an offline summarization pipeline favors throughput.

Play 4: The Output-Shaping Play

Trigger: Inter-token latency is fine but total time is high, and responses are long.

Owner: Product engineer who owns prompts and UX.

Total generation time scales with output length, full stop. The fastest token is the one you never generate. Plays here include capping max_tokens to what the UI can actually show, prompting for concise answers, and streaming so the user reads while the model writes. Streaming doesn't make inference faster, but it cuts perceived latency more than almost anything else.

The discipline this requires is covered in Building a Repeatable Workflow for AI Inference and Latency, because output shaping only sticks if it's part of the standard prompt review.

Play 5: The Capacity and Fallback Play

Trigger: A provider incident, a rate-limit spike, or a planned launch with a traffic surge.

Owner: Platform lead.

Latency under stress is a different problem than latency at rest. This play covers provisioned throughput for predictable spikes, a secondary provider or model for failover, and graceful degradation, meaning you serve a smaller model or a cached answer rather than a spinner. The worst failure mode in production is not slowness; it is a hung request that never resolves. Always set hard timeouts and define what happens when they fire.

Sequencing: The Order You Run These

Run the plays in cost-to-implement order, not impractical order:

Output shaping — cheapest, often biggest perceived win.
Caching — high leverage, moderate effort.
Routing — needs a classifier and monitoring.
Batching/concurrency — infrastructure work, self-host heavy.
Capacity/fallback — operational maturity, do before any big launch.

Resist the urge to start with the most technically interesting play. Teams that begin with custom batching kernels before capping output tokens are optimizing the wrong end. For the full set of process habits, AI Inference and Latency: Best Practices That Actually Work is the companion to this sequencing.

Frequently Asked Questions

How often should I run these plays?

Treat triggers as continuous and reviews as scheduled. The plays fire automatically when a threshold breaks, but you should also do a monthly latency review where the owners walk through each play's metrics even if nothing tripped. Drift is slow and you want to catch it before a trigger does.

Which play gives the fastest return?

Output shaping, almost always. Capping output length and streaming responses takes hours, not weeks, and improves the number users actually feel. Caching is a close second if your prompts repeat. Both beat infrastructure work for early-stage products.

Do I need to self-host to use this playbook?

No. Routing, caching, output shaping, and fallback all work on managed APIs. Only the in-flight batching play assumes you control the inference server. Most teams should exhaust the managed-API plays before considering self-hosting.

What's the most common mistake teams make here?

Optimizing the average instead of the tail. A great p50 with a terrible p99 means a real slice of your users has a bad experience every session. Watch 7 Common Mistakes with AI Inference and Latency (and How to Avoid Them) for the rest.

How do I assign owners without a big team?

One person can own multiple plays as long as each play has exactly one accountable name. The danger is shared ownership, where everyone assumes someone else is watching the dashboard. Even a two-person team should write down who owns what.

Key Takeaways

A playbook turns latency from an emergency into a routine of named plays with triggers and owners.
Instrument TTFT, inter-token latency, and p95/p99 before doing anything else.
Run the plays in cost order: output shaping, caching, routing, batching, then capacity.
The fastest token is the one you never generate, so cap output and stream by default.
Watch the tail, not the average, and always set hard timeouts so requests never hang.

How to Read This Playbook

The three latency numbers that drive every play

Before you run anything, instrument these three and nothing else for triage:

Time to first token (TTFT) — how long the user waits before anything appears. This is the perceived-speed number.
Inter-token latency — the gap between streamed tokens. Smoothness lives here.
End-to-end p95 and p99 — total request time at the tail, not the average. Averages hide the requests that actually make people angry.

If you only watch averages, you will ship a system that feels fine in demos and falls apart for the 5 percent of users on a bad route or a long prompt.

Play 1: The Routing Play

Trigger: p95 TTFT exceeds your target (commonly 800ms to 1.2s for chat) for two consecutive hours, or a single model is carrying more than 80 percent of traffic it doesn't need.

Owner: Platform engineer on call.

Play 2: The Caching Play

Trigger: You see repeated or near-repeated prompts, or your system prompt is large and stable across requests.

Owner: Application engineer who owns the prompt layer.

Caching is the highest-leverage latency move most teams skip. There are two tiers:

Prompt/prefix caching — providers cache the processed system prompt and reuse it, cutting TTFT dramatically when your prefix is large and constant. Structure prompts so the stable part comes first.
Semantic response caching — for requests that repeat in meaning, return a stored answer instead of re-running inference. This is the move behind a lot of the wins in AI Inference and Latency: Real-World Examples and Use Cases.

The failure mode is staleness. A cached answer that's wrong because the underlying data changed is worse than a slow correct one. Set a TTL tied to how fast your data moves, and invalidate on writes.

Play 3: The Batching and Concurrency Play

Trigger: GPU utilization is low while a queue is forming, or throughput plateaus under load.

Owner: Infrastructure engineer.

Play 4: The Output-Shaping Play

Trigger: Inter-token latency is fine but total time is high, and responses are long.

Owner: Product engineer who owns prompts and UX.

The discipline this requires is covered in Building a Repeatable Workflow for AI Inference and Latency, because output shaping only sticks if it's part of the standard prompt review.

Play 5: The Capacity and Fallback Play

Trigger: A provider incident, a rate-limit spike, or a planned launch with a traffic surge.

Owner: Platform lead.

Sequencing: The Order You Run These

Run the plays in cost-to-implement order, not impractical order:

Output shaping — cheapest, often biggest perceived win.
Caching — high leverage, moderate effort.
Routing — needs a classifier and monitoring.
Batching/concurrency — infrastructure work, self-host heavy.
Capacity/fallback — operational maturity, do before any big launch.

Frequently Asked Questions

How often should I run these plays?

Which play gives the fastest return?

Do I need to self-host to use this playbook?

What's the most common mistake teams make here?

How do I assign owners without a big team?

Key Takeaways

A playbook turns latency from an emergency into a routine of named plays with triggers and owners.
Instrument TTFT, inter-token latency, and p95/p99 before doing anything else.
Run the plays in cost order: output shaping, caching, routing, batching, then capacity.
The fastest token is the one you never generate, so cap output and stream by default.
Watch the tail, not the average, and always set hard timeouts so requests never hang.

Latency Is Not a Knob You Poke at After Launch

How to Read This Playbook

The three latency numbers that drive every play

Play 1: The Routing Play

Play 2: The Caching Play

Play 3: The Batching and Concurrency Play

Play 4: The Output-Shaping Play

Play 5: The Capacity and Fallback Play

Sequencing: The Order You Run These

Frequently Asked Questions

How often should I run these plays?

Which play gives the fastest return?

Do I need to self-host to use this playbook?

What's the most common mistake teams make here?

How do I assign owners without a big team?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Latency Is Not a Knob You Poke at After Launch

How to Read This Playbook

The three latency numbers that drive every play

Play 1: The Routing Play

Play 2: The Caching Play

Play 3: The Batching and Concurrency Play

Play 4: The Output-Shaping Play

Play 5: The Capacity and Fallback Play

Sequencing: The Order You Run These

Frequently Asked Questions

How often should I run these plays?

Which play gives the fastest return?

Do I need to self-host to use this playbook?

What's the most common mistake teams make here?

How do I assign owners without a big team?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?