A GPU reporting 90 percent utilization can still be wasting half its capacity. The headline utilization number that monitoring dashboards show by default counts any cycle where the GPU is doing something, even if that something is waiting on memory or stalling on a tiny kernel. Teams see a high number, conclude the hardware is saturated, and buy more cards when the real problem is a pipeline that starves the silicon.
Measuring AI compute well means picking metrics that map to the question you are actually asking. Are you deciding whether to scale out? Tuning a serving stack? Justifying a budget? Each question wants a different KPI. This piece defines the metrics that matter, explains how to instrument them, and shows how to read the signal so you stop reacting to numbers that do not mean what you think.
The Four Metrics That Tell the Real Story
There is no single number for compute health. These four, read together, give you an honest picture.
Throughput at a Fixed Latency
For inference, the metric that matters is how many requests or tokens you can serve per second without breaching your latency target. Throughput alone is meaningless because you can always push it higher by letting latency degrade. Always quote throughput with the latency ceiling it was measured at, such as "120 tokens per second at p95 under 200 milliseconds." This pairing is what you optimize and what you size capacity against.
Model FLOPs Utilization (MFU)
MFU is the fraction of the GPU's theoretical peak compute that your workload actually achieves. Unlike the default utilization gauge, it is honest about waste. A training job at 35 percent MFU is leaving roughly two-thirds of the hardware on the table, usually to data loading, communication, or memory stalls. MFU in the 40 to 55 percent range is healthy for large training jobs; below 30 percent signals a pipeline problem, not a hardware shortage.
Memory Footprint and Headroom
Track peak memory, not average. A job that averages 60 percent VRAM but spikes to 98 percent during a long sequence is one batch away from an out-of-memory crash. Headroom is a metric, not an afterthought. Aim for a consistent buffer so that a slightly larger input does not take production down.
Cost per Unit of Useful Work
The metric that ties everything to the business is cost per result: dollars per thousand inferences, per training run, or per million tokens. It absorbs utilization, idle time, and instance pricing into one comparable number. Two setups with identical throughput can differ by 3x on cost per result once idle time and instance choice are included.
How to Instrument Without Drowning in Data
Good instrumentation is layered. Each layer answers a different question, and you should not start at the bottom.
- Layer one, the bill. Your cloud invoice or power draw is the ground truth for spend. Start here because it is the number leadership cares about and it needs no new tooling.
- Layer two, device telemetry. GPU vendor tooling exposes utilization, memory, temperature, and clock. Scrape it into your existing metrics system at a modest interval. This catches idle cards and thermal throttling.
- Layer three, framework counters. Your training or serving framework can report MFU, batch sizes, and queue depth. This is where you diagnose why a card is busy but unproductive.
- Layer four, request traces. For inference, trace individual requests end to end so you can attribute latency to queueing, prefill, or decode. Only add this when you have a latency problem to chase.
Resist the urge to wire up all four on day one. Most teams get 80 percent of the value from the bill plus device telemetry. For the broader setup, our Getting Started with Ai Compute and Gpu Requirements covers standing up a first measurable workload.
Reading the Signal: What the Numbers Are Telling You
Numbers without interpretation lead to bad calls. A few patterns recur often enough to name.
High Utilization, Low MFU
The card looks busy but MFU is low. The GPU is spending its time on overhead, not math. The fix is almost never more hardware. Look at data loading, small batch sizes, or communication between cards. Buying more GPUs here multiplies the waste.
Low Utilization, Hitting Latency Targets
The card is half-idle but requests are slow. This usually means a serving configuration problem such as insufficient batching or a single-threaded preprocessing step. The hardware has room; the software is not feeding it.
Climbing Cost per Result Over Time
Throughput is steady but cost per result is creeping up. Look for idle capacity from over-provisioning, a workload that drifted, or reserved capacity you stopped using. This is the metric that catches slow financial leaks the dashboards miss. The Hidden Risks of Ai Compute and Gpu Requirements covers how these leaks accumulate unnoticed.
Setting Targets and Thresholds
A metric without a target is just a number on a screen. Set thresholds that trigger action, and tie them to alerts so you are not eyeballing graphs.
Reasonable starting thresholds: alert when memory headroom drops below 10 percent, when MFU on a training job falls under 30 percent, and when cost per result rises more than 20 percent month over month. These are starting points to calibrate against your own baseline, not universal truths. The point is to convert measurement into a decision before the problem becomes a budget surprise. For team-wide standards on what to track, see Rolling Out Ai Compute and Gpu Requirements Across a Team.
Avoid the Vanity Metric Trap
The biggest measurement mistake is not missing a metric; it is tracking metrics that feel productive but drive no decision. GPU temperature, total instance count, and raw FLOPs delivered all look like progress on a dashboard and almost never change a choice you make. They are vanity metrics, and a dashboard full of them creates a false sense of control.
The test for whether a metric earns its place is simple: name the decision it would change. If you cannot, drop it. Memory headroom changes whether you scale; MFU changes whether you tune software or buy hardware; cost per result changes your whole strategy. A green light on temperature changes nothing. Pruning vanity metrics is as important as adding real ones, because every number on a dashboard competes for the limited attention of the person reading it.
Connect Metrics to a Baseline
A metric in isolation is noise; a metric against a baseline is signal. Capture a baseline for each KPI when a workload first stabilizes, then measure drift against it rather than reading absolute values cold. A cost per result of a given number means nothing on its own, but a 20 percent rise against last month's baseline is an unambiguous prompt to investigate. The baseline is what turns a measurement into a question worth answering, and it is the discipline that separates teams who watch dashboards from teams who act on them.
Frequently Asked Questions
Why is the default GPU utilization metric misleading?
The standard utilization gauge counts any cycle where the GPU is active, including time spent waiting on memory or running tiny inefficient kernels. It can read high while the actual compute units sit mostly idle. Use Model FLOPs Utilization to see how much real work the hardware is doing.
What is a good MFU for a training job?
For large transformer training, 40 to 55 percent MFU is a healthy range on modern accelerators. Below 30 percent indicates a pipeline bottleneck such as slow data loading or communication overhead, which more hardware will not fix. The exact achievable ceiling depends on model architecture and interconnect.
Should I optimize for throughput or latency?
Optimize throughput subject to a fixed latency ceiling. Throughput alone can always be inflated by accepting worse latency, so the two must be measured together. Decide your latency budget from product requirements first, then maximize throughput within it.
How do I measure cost per result accurately?
Take total spend over a window, including idle time and reserved capacity, and divide by the useful work produced in that window, such as inferences served or training runs completed. This rolls utilization and pricing into one comparable figure that surfaces waste the device metrics hide.
How frequently should I sample compute metrics?
Device telemetry at a 10 to 30 second interval is usually enough for capacity decisions without flooding your metrics store. Request traces are sampled per request but only enabled when chasing a specific latency issue. The bill is reviewed weekly or monthly.
Key Takeaways
- The default utilization number lies; use Model FLOPs Utilization to see real productivity.
- Always quote throughput together with the latency target it was measured at.
- Track peak memory and headroom, not averages, to avoid out-of-memory crashes.
- Cost per unit of useful work is the metric that ties compute to the business.
- Instrument in layers starting with the bill, and set thresholds that trigger action.