Most teams measure the wrong latency number and then spend weeks optimizing the wrong thing. They watch average response time, see it sitting at 800 milliseconds, and conclude the system feels fine. Meanwhile a tenth of their users wait four seconds for a streamed answer that never seems to start. Averages hide the tail, and the tail is where users abandon, retry, and churn.
AI inference latency is not a single number. It is a distribution of several distinct timing components, each with its own cause and its own fix. If you cannot name which component is slow, you cannot make it fast. This article defines the metrics that matter for large language model and other generative inference, explains how to instrument each one, and shows you how to read the signal so the next optimization sprint targets the bottleneck instead of a vanity statistic.
The Latency Metrics That Actually Matter
For token-streaming models, two metrics dominate user perception, and a third governs your cost ceiling.
Time to First Token (TTFT)
TTFT is the wall-clock time from request submission to the first token arriving back. This is the number users feel as "responsiveness." A 200ms TTFT feels instant. A 2-second TTFT feels broken even if the full answer arrives at the same time. TTFT is dominated by prompt length, queue depth, and the prefill phase where the model processes your entire input before generating anything.
Inter-Token Latency (ITL) and Tokens Per Second
Once generation starts, ITL measures the gap between consecutive output tokens. The inverse, tokens per second, tells you how fast the answer "reads out." For chat interfaces, anything above roughly 30 tokens per second outpaces fluent reading and feels smooth. Below 10, users watch a stutter.
End-to-End Latency and Throughput
End-to-end latency is the full request duration. Throughput is requests or tokens served per second across the whole system. These two are in tension: batching more requests raises throughput but can raise TTFT for individual users. Knowing both lets you make that trade-off deliberately instead of by accident.
Why Percentiles Beat Averages
Never report inference latency as a mean. Report p50, p95, and p99.
- p50 (median) tells you the typical experience.
- p95 tells you what your unlucky users feel — and they are a large minority.
- p99 exposes the pathological tail: cold starts, evicted KV cache, oversized prompts, retries.
A system with a 400ms median and a 6-second p99 has a real problem that the average completely conceals. Tail latency compounds in multi-step agent workflows: chain five calls each with a fat p99 and the combined experience is reliably slow. If you read one thing into your dashboards, make it the p95 and p99 of TTFT.
How to Instrument Each Component
You cannot fix what you do not segment. Instrument latency in layers.
Client-Side Timing
Capture the timestamp at request initiation and at first-byte receipt in the client. This includes network and TLS overhead that server metrics miss. For real user experience, this is the ground truth.
Server and Model Timing
Inside the serving layer, record:
- Queue time — how long the request waited before a worker picked it up. Rising queue time means you are capacity-bound, not model-bound.
- Prefill time — duration to process the input prompt. Scales with input token count.
- Decode time — duration of token generation. Scales with output token count.
Splitting prefill from decode is the single most useful instrumentation decision you can make. A slow TTFT caused by prefill points to prompt size or batching policy; a slow TTFT caused by queue time points to under-provisioned capacity. The fixes are completely different.
Token Accounting
Log input and output token counts on every request. Latency without token counts is uninterpretable, because a 3-second response on 4,000 output tokens is excellent while the same 3 seconds on 50 tokens is alarming. This connects directly to the cost story covered in The ROI of AI Inference and Latency: Building the Business Case.
Reading the Signal: Diagnosing From the Numbers
Once you have the components, diagnosis becomes mechanical.
- High TTFT, low queue time, large prompts → prefill-bound. Trim context, cache prompt prefixes, or reduce retrieved chunks.
- High TTFT, high queue time → capacity-bound. Add replicas or raise batch concurrency.
- Low TTFT, slow tokens per second → decode-bound. Consider a smaller or quantized model, speculative decoding, or better hardware.
- Good p50, terrible p99 → cold starts or cache eviction. Keep instances warm and pin hot prompts.
The same disciplined diagnosis underpins the patterns in AI Inference and Latency: Best Practices That Actually Work, and the avoidable errors in 7 Common Mistakes with AI Inference and Latency almost always trace back to a missing metric.
Setting Targets and Budgets
Metrics are useless without thresholds. Define a latency budget per use case:
- Interactive chat: TTFT p95 under 1 second, 25+ tokens/sec.
- Autocomplete / inline suggestions: TTFT p95 under 300ms; speed beats quality here.
- Batch or async generation: end-to-end matters, TTFT is irrelevant; optimize throughput and cost.
Write these targets into your service-level objectives and alert when p95 breaches them, not when the average does. A budget turns a vague "feels slow" complaint into a concrete, testable engineering goal. For the full instrumentation walkthrough, pair this with A Step-by-Step Approach to AI Inference and Latency.
Connecting Metrics to Quality and Cost
Latency metrics in isolation lie by omission. A system can be blazingly fast because it switched to a model that gives worse answers, and a latency dashboard alone would call that a win. To avoid optimizing yourself into a corner, instrument three families of metrics together and read them as one picture.
Pair Latency With Quality Signals
Every latency improvement should be checked against quality. In production, the practical quality signals are behavioral: thumbs-down rates, retry rates, escalation to a human or to a larger model, and conversation abandonment. When you ship a latency change, watch these in parallel. A 40% TTFT improvement that doubles the retry rate is a regression wearing a disguise, and only the joint view reveals it. This is the discipline that keeps the failure modes in The Hidden Risks of AI Inference and Latency from creeping in unnoticed.
Pair Latency With Cost Per Request
Token counts are the bridge between latency and money. Cost per request equals tokens per request times your effective cost per token, and that same token count drives prefill and decode time. This means many optimizations move both metrics at once: trimming a system prompt lowers prefill latency and per-request cost together. Track cost per request on the same dashboard as p95 latency so the trade-offs are visible in one glance rather than discovered in a monthly invoice.
Build a Single Health View
The mature setup shows, per service and per use case: p50/p95/p99 of TTFT and end-to-end, tokens per second, average input and output tokens, cost per request, and at least one quality signal. When all of these sit on one screen, diagnosis stops being archaeology. You see that p99 spiked, that it correlates with longer inputs, that cost per request rose with it, and you know exactly where to look. That unified view is what separates teams who manage latency from teams who merely react to it.
Frequently Asked Questions
What is the single most important inference latency metric?
Time to first token at the p95 percentile. It is what users perceive as responsiveness, and the percentile captures the unlucky-but-common experience that the average hides. If you track only one number, track that one.
Why shouldn't I use average latency?
Averages mask the tail. A system can have a healthy mean while one in twenty requests is painfully slow. Users remember the slow ones. Percentiles (p50, p95, p99) reveal the distribution and the failure modes that averages erase.
How do I separate prompt processing from generation time?
Instrument prefill time and decode time separately inside your serving layer. Prefill scales with input length; decode scales with output length. Splitting them tells you whether to shrink your prompt or change your model and hardware.
What is a good tokens-per-second rate for chat?
Roughly 25 to 30 tokens per second or higher feels smooth, because it outpaces fluent reading speed. Below 10 tokens per second users see visible stutter. Inline autocomplete needs much faster first-token times but fewer tokens overall.
Does network latency count as inference latency?
For the user, yes — they feel total time. Measure client-side first-byte timing to capture network and TLS overhead, but keep it as a separate component so you do not blame the model for a slow connection.
Key Takeaways
- Inference latency is a distribution of components, not one number; name the component before you optimize.
- TTFT and inter-token latency drive user perception; throughput governs cost.
- Always report p50, p95, and p99 — averages hide the tail where users abandon.
- Instrument queue, prefill, and decode time separately to make diagnosis mechanical.
- Log token counts alongside latency, or the numbers are uninterpretable.
- Set per-use-case latency budgets and alert on p95 breaches, not on the mean.