Once your model fits in memory and your jobs run reliably, the basics are behind you and the easy wins are gone. The next tier of performance and cost savings comes from the parts of the system nobody prints on a spec sheet: how the KV cache is managed, how GPUs talk to each other, how requests are scheduled, and where the long tail of latency hides. This is the territory where a team that understands its workload deeply can run two or three times cheaper than one that simply buys more cards.
This guide is for practitioners who already know how to size and provision compute. It goes into the edge cases, the second-order effects, and the expert nuances that separate a competently run fleet from an optimized one. The throughline is that at this level, software and topology decisions dominate hardware choice.
The KV Cache Is the Hidden Memory Hog
For transformer inference, the key-value cache is the memory structure that grows with every token of context and every concurrent request. On the basics you treat model weights as your memory budget. At the advanced level you realize the KV cache can consume as much memory as the weights, and that managing it is where serving capacity is won or lost.
Paged and Shared Cache
Naive serving allocates a contiguous cache block per request sized for the maximum context, wasting memory on short requests. Paged attention allocates the cache in small blocks on demand, dramatically increasing how many concurrent requests fit on a card. Adopting a serving engine that does this is often a larger throughput win than upgrading hardware.
Cache Reuse Across Requests
When many requests share a common prefix, such as a long system prompt, that portion of the cache can be computed once and shared. At scale this prefix caching cuts both latency and memory. Recognizing the opportunity requires knowing your traffic shape, which is why instrumentation matters more here than anywhere.
Multi-GPU Topology Is a First-Class Concern
When a model or a workload spans multiple GPUs, how those GPUs are wired together stops being a detail and becomes a primary constraint. Advanced practitioners reason about topology before throughput.
- Tensor parallelism splits a single layer across GPUs and is extremely communication-heavy, so it demands the fastest possible interconnect within a node. Spread it across slow links and communication dominates.
- Pipeline parallelism splits layers across GPUs and tolerates slower links, but introduces bubbles where some GPUs idle waiting for others.
- Data parallelism replicates the model and is the simplest, but multiplies memory cost.
The art is matching the parallelism strategy to the interconnect you have. A choice that is optimal on a high-bandwidth node is wasteful across a slower fabric. This is the kind of trade-off our trade-offs guide introduces and that advanced work pushes to its limits.
Disaggregated Serving: Splitting Prefill From Decode
A subtle but powerful technique is recognizing that the two phases of generation have opposite resource profiles. Prefill, processing the prompt, is compute-bound and bursty. Decode, generating tokens one at a time, is memory-bandwidth-bound and steady. Running both on the same hardware means one phase always starves the other.
Disaggregated serving places prefill and decode on separate, differently sized resources so each runs at its natural efficiency. This adds orchestration complexity but can sharply improve utilization and cost per token at scale. It is not worth the complexity for small deployments, but for high-volume serving it is among the highest-leverage moves available. Weigh the operational cost against the savings using the framework in The ROI of Ai Compute and Gpu Requirements.
Chasing the Latency Long Tail
At the basic level you optimize average latency. At the advanced level you discover that average is a comforting lie and the tail is what hurts users. The p99 latency, the slowest one percent of requests, is often several times the median, and that tail is where complaints and SLA breaches live.
The tail usually comes from a few sources: a long request blocking a batch, a cold cache miss, garbage collection or memory pressure, or contention from a noisy neighbor. Diagnosing it requires per-request tracing, not aggregate dashboards. The fixes are specific, such as request-length-aware batching that prevents a single long generation from stalling everything behind it. Tackling the tail is unglamorous and high-value, and it is where the metrics discipline from How to Measure Ai Compute and Gpu Requirements earns its keep.
Precision and Quantization at the Edge of Quality
Basic quantization to FP8 is now routine. Advanced work pushes lower, to INT4 and mixed schemes, and that is where quality becomes a live risk. The expert skill is knowing which parts of a model tolerate aggressive quantization and which must stay at higher precision.
Outlier weights and certain layers degrade quality disproportionately when quantized hard, so mixed-precision schemes keep the sensitive parts precise while crushing the rest. Validating that a quantized model still meets quality bars on your actual task, not a generic benchmark, is non-negotiable. A model that looks fine on a standard benchmark can fail badly on your domain. This is the failure mode that turns a cost win into a quality incident, so treat quantization changes with the same rigor as a model change.
When More Hardware Is Genuinely the Answer
Advanced practice is mostly about extracting more from fixed hardware, but discipline cuts both ways: sometimes the honest answer is that you are saturated and need more capacity. The signal is a high, real Model FLOPs Utilization combined with a memory wall you cannot engineer around. If the software is already efficient and the cards are genuinely busy doing useful math, scaling out is correct.
The mistake is reaching for that conclusion first. Exhaust the software and topology levers, confirm with instrumentation that the hardware is the bottleneck, and only then buy. The teams that stay efficient at scale are the ones who treat more hardware as the last lever, not the first.
Frequently Asked Questions
How much memory does the KV cache actually use?
It scales with the number of concurrent requests and the length of their contexts, and at scale it can rival or exceed the model weights themselves. This is why paged attention and prefix caching matter so much; they let far more concurrent requests share a card than naive allocation allows.
When should I use tensor parallelism versus pipeline parallelism?
Use tensor parallelism only when you have very fast interconnect within a node, because it is communication-heavy. Use pipeline parallelism when links are slower, accepting some idle bubbles. Match the strategy to your actual interconnect; a choice that is optimal on fast hardware wastes resources on slow fabric.
Is disaggregated serving worth the complexity?
Only at high volume. Splitting prefill and decode onto separate resources improves utilization because the two phases have opposite resource profiles, but it adds real orchestration overhead. For small deployments the complexity outweighs the gain; for large serving fleets it is among the highest-leverage optimizations.
How low can I quantize without hurting quality?
It depends on the model and task, and the only reliable answer comes from validating on your actual workload. FP8 is usually safe; INT4 and below need mixed-precision schemes that protect sensitive layers and outlier weights. Always test against your domain rather than a generic benchmark.
Why does p99 latency matter more than average?
Average latency hides the slow tail of requests where users actually experience pain and SLAs break. The p99 is often several times the median, driven by long requests blocking batches or cache misses. You cannot see or fix the tail with aggregate dashboards; it requires per-request tracing.
Key Takeaways
- The KV cache can consume as much memory as model weights; paged and prefix caching are major wins.
- Match your parallelism strategy to your interconnect topology, not the other way around.
- Disaggregated serving pays off at high volume by giving prefill and decode their natural resources.
- Optimize the latency long tail with per-request tracing; the average hides the real pain.
- Push quantization carefully and validate quality on your actual task, never a generic benchmark.