MISER: Being Stingy With Inference Compute

One-off latency fixes work until they do not. The third time you debug a slow AI feature from scratch, you realize you need a reusable model — a repeatable way to reason about any inference system rather than improvising each time. This article offers one. We call it MISER, because reducing inference latency is fundamentally about being miserly with compute: do less, and do it closer to where it is needed.

MISER stands for Measure, Isolate, Shrink, Edge, Reassess. The five stages run in order, and the last loops back to the first. It is deliberately simple. A framework you can hold in your head beats a more "complete" one you have to look up, and the stages map cleanly onto how inference latency actually decomposes.

Use it for diagnosing a slow feature, for designing a new one, or as a shared vocabulary so a team can argue about the right stage rather than talking past each other.

Stage 1: Measure

Everything starts with real numbers. You cannot reason about latency you have not instrumented.

What this stage requires

Latency split into segments: network, queue, time to first token, inter-token, total.
Percentiles (p50, p95, p99), never averages, because the tail is what matters.
Data collected under realistic concurrent load, not single requests.

If you skip Measure, every later stage is a guess. This is the same foundation built in A Step-by-Step Approach to AI Inference and Latency. The output of this stage is a clear picture of where time goes.

Stage 2: Isolate

With data in hand, find the single dominant cost. There is almost always one segment that dominates, and the framework forces you to name it before acting.

High TTFT with a short prompt points to queueing or cold start.
High TTFT with a long prompt points to prefill and oversized context.
Slow per-token streaming points to decode and model size.
A spiky tail alone points to batching and concurrency.

Isolate is the discipline stage. Its entire purpose is to stop you from fixing a symptom that is not the bottleneck. Do not leave it until you can point to one cause.

Stage 3: Shrink

Now reduce the dominant cost — and the verb is deliberate, because most latency wins come from shrinking something: the model, the context, or the work done per request.

Shrink the model — right-size to the smallest model meeting quality, or quantize when decode is the issue.
Shrink the context — trim prompts, cap history, retrieve fewer documents.
Shrink the work — cache prompt prefixes and full responses so repeated computation disappears.

The unifying idea is subtraction. The fastest inference is the inference you do not have to compute. This stage draws directly on AI Inference and Latency: Best Practices That Actually Work.

Stage 4: Edge

"Edge" is about location and serving — moving compute closer and serving it smarter.

What this stage covers

Co-locate the application and inference server in the same region to cut network round trips.
Enable continuous batching so requests join in-flight batches instead of waiting.
Pre-warm connections and capacity to avoid cold-start tail spikes.
Tune batch windows to the workload — tight for interactive, wide for batch.

Edge handles the latency that is not about the model or the prompt but about how and where requests are served. It is often where the tail latency hides once Shrink has handled the bulk.

Stage 5: Reassess

Re-run the exact measurement from Stage 1 and compare. Did the dominant cost shrink? Did you hit your target? Did anything regress — quality, cost, a different segment?

Reassess closes the loop. If the target is met, you stop and add perceived-speed polish like streaming. If not, you return to Isolate with fresh data, because the dominant cost has likely moved. Latency optimization is iterative; the second bottleneck only becomes visible once the first is gone.

When to Apply Each Stage

You do not always run all five. For a brand-new feature, run Measure and Isolate early on a prototype, then design Shrink and Edge in from the start. For an existing slow feature, run the full loop. For a system that was fast and degraded, jump to Measure and usually find the answer in Isolate — traffic grew and Edge-stage batching no longer keeps up.

The framework's value is that it gives the same map regardless of entry point. You always know which stage you are in and what its output should be.

A Walkthrough of the Full Loop

To see MISER in motion, run it on a slow retrieval-augmented assistant.

Measure: instrument each stage and discover p95 TTFT is 3 seconds, far over the 600 ms target.
Isolate: the prompt is long because retrieval returns twenty documents — this is a prefill problem driven by oversized context, not decode.
Shrink: cut retrieval to the five most relevant documents and cache the static system prompt as a prefix.
Edge: co-locate the app with the inference server and enable continuous batching to handle concurrency.
Reassess: re-measure; p95 TTFT is now 700 ms — better, but still over target. Loop back to Isolate. The new dominant cost is queueing at peak, which a wider warm-capacity pool resolves on the next pass.

The loop did its job. The first pass killed the context bottleneck; the second revealed and fixed queueing. You never touched the model, because Isolate never pointed there. That restraint is the framework working as intended.

Why the Loop Beats a Linear Process

A linear checklist assumes you fix everything once and finish. Real systems do not cooperate. Fixing the dominant cost almost always exposes a second one that was hidden behind it, and a third behind that. MISER's loop is built for this reality.

The compounding payoff

Each pass through the loop is cheaper than the last, because your instrumentation is already in place and you have learned where this particular system hides its costs. The first loop is the expensive one; subsequent loops are fast. Teams that adopt the framework find that latency stops being a recurring fire drill and becomes routine maintenance — the same shift toward habit that AI Inference and Latency: Best Practices That Actually Work argues for.

Frequently Asked Questions

Why a named framework instead of just a checklist?

A checklist tells you what to do; a framework tells you how to think and in what order. MISER gives a team shared language — "we are stuck in Isolate" is a more useful statement than a list of unchecked boxes. The two complement each other.

What if Isolate shows two equal bottlenecks?

Attack the larger one first, then loop back through Reassess. Fixing the bigger cost often changes the picture entirely, sometimes making the second one irrelevant. Rarely should you try to fix two segments at once, because you lose the ability to attribute the improvement.

Does the framework apply to non-LLM inference?

Yes. Measure, Isolate, Shrink, Edge, Reassess work for any inference system — image models, recommendation systems, classifiers. The specifics of Shrink and Edge change, but the loop and the discipline of measuring before changing are universal.

How is Shrink different from Edge?

Shrink reduces the amount of work — smaller model, less context, cached results. Edge changes where and how that work is served — location, batching, warm capacity. They address different latency sources, which is why the framework separates them rather than lumping all optimization together.

When do I stop looping?

When Reassess shows you have met your latency target without an unacceptable regression in quality or cost. At that point you switch from reducing real latency to improving perceived speed. There is always more you could shave, but the target tells you when more is not worth it.

Key Takeaways

MISER — Measure, Isolate, Shrink, Edge, Reassess — is a reusable loop for any inference latency problem.
Measure first with segmented percentiles under load; never act on guesses.
Isolate forces you to name the single dominant cost before fixing anything.
Shrink reduces work: smaller models, less context, more caching.
Edge moves and serves compute smarter through co-location, batching, and warm capacity.
Reassess closes the loop and tells you when to stop or where to iterate next.

Use it for diagnosing a slow feature, for designing a new one, or as a shared vocabulary so a team can argue about the right stage rather than talking past each other.

Stage 1: Measure

Everything starts with real numbers. You cannot reason about latency you have not instrumented.

What this stage requires

Latency split into segments: network, queue, time to first token, inter-token, total.
Percentiles (p50, p95, p99), never averages, because the tail is what matters.
Data collected under realistic concurrent load, not single requests.

Stage 2: Isolate

With data in hand, find the single dominant cost. There is almost always one segment that dominates, and the framework forces you to name it before acting.

High TTFT with a short prompt points to queueing or cold start.
High TTFT with a long prompt points to prefill and oversized context.
Slow per-token streaming points to decode and model size.
A spiky tail alone points to batching and concurrency.

Isolate is the discipline stage. Its entire purpose is to stop you from fixing a symptom that is not the bottleneck. Do not leave it until you can point to one cause.

Stage 3: Shrink

Now reduce the dominant cost — and the verb is deliberate, because most latency wins come from shrinking something: the model, the context, or the work done per request.

Shrink the model — right-size to the smallest model meeting quality, or quantize when decode is the issue.
Shrink the context — trim prompts, cap history, retrieve fewer documents.
Shrink the work — cache prompt prefixes and full responses so repeated computation disappears.

The unifying idea is subtraction. The fastest inference is the inference you do not have to compute. This stage draws directly on AI Inference and Latency: Best Practices That Actually Work.

Stage 4: Edge

"Edge" is about location and serving — moving compute closer and serving it smarter.

What this stage covers

Co-locate the application and inference server in the same region to cut network round trips.
Enable continuous batching so requests join in-flight batches instead of waiting.
Pre-warm connections and capacity to avoid cold-start tail spikes.
Tune batch windows to the workload — tight for interactive, wide for batch.

Edge handles the latency that is not about the model or the prompt but about how and where requests are served. It is often where the tail latency hides once Shrink has handled the bulk.

Stage 5: Reassess

Re-run the exact measurement from Stage 1 and compare. Did the dominant cost shrink? Did you hit your target? Did anything regress — quality, cost, a different segment?

When to Apply Each Stage

The framework's value is that it gives the same map regardless of entry point. You always know which stage you are in and what its output should be.

A Walkthrough of the Full Loop

To see MISER in motion, run it on a slow retrieval-augmented assistant.

Measure: instrument each stage and discover p95 TTFT is 3 seconds, far over the 600 ms target.
Isolate: the prompt is long because retrieval returns twenty documents — this is a prefill problem driven by oversized context, not decode.
Shrink: cut retrieval to the five most relevant documents and cache the static system prompt as a prefix.
Edge: co-locate the app with the inference server and enable continuous batching to handle concurrency.
Reassess: re-measure; p95 TTFT is now 700 ms — better, but still over target. Loop back to Isolate. The new dominant cost is queueing at peak, which a wider warm-capacity pool resolves on the next pass.

Why the Loop Beats a Linear Process

The compounding payoff

Frequently Asked Questions

Why a named framework instead of just a checklist?

What if Isolate shows two equal bottlenecks?

Does the framework apply to non-LLM inference?

How is Shrink different from Edge?

When do I stop looping?

Key Takeaways

MISER — Measure, Isolate, Shrink, Edge, Reassess — is a reusable loop for any inference latency problem.
Measure first with segmented percentiles under load; never act on guesses.
Isolate forces you to name the single dominant cost before fixing anything.
Shrink reduces work: smaller models, less context, more caching.
Edge moves and serves compute smarter through co-location, batching, and warm capacity.
Reassess closes the loop and tells you when to stop or where to iterate next.

MISER: Being Stingy With Inference Compute

Stage 1: Measure

What this stage requires

Stage 2: Isolate

Stage 3: Shrink

Stage 4: Edge

What this stage covers

Stage 5: Reassess

When to Apply Each Stage

A Walkthrough of the Full Loop

Why the Loop Beats a Linear Process

The compounding payoff

Frequently Asked Questions

Why a named framework instead of just a checklist?

What if Isolate shows two equal bottlenecks?

Does the framework apply to non-LLM inference?

How is Shrink different from Edge?

When do I stop looping?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

MISER: Being Stingy With Inference Compute

Stage 1: Measure

What this stage requires

Stage 2: Isolate

Stage 3: Shrink

Stage 4: Edge

What this stage covers

Stage 5: Reassess

When to Apply Each Stage

A Walkthrough of the Full Loop

Why the Loop Beats a Linear Process

The compounding payoff

Frequently Asked Questions

Why a named framework instead of just a checklist?

What if Isolate shows two equal bottlenecks?

Does the framework apply to non-LLM inference?

How is Shrink different from Edge?

When do I stop looping?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?