Most teams make compute decisions one at a time, from scratch, every time. The result is inconsistency: one project over-provisions, the next runs out of memory, and nobody can explain why either happened. A framework fixes this by turning sizing into a repeatable process whose output you can defend.
This guide introduces FRAME β a five-stage model for AI compute decisions. It is not magic; it is the underlying logic of good sizing made explicit and reusable so that anyone on your team can apply it and reach the same answer. Each stage feeds the next, and the order matters. When you find yourself unsure what GPU to use, you run FRAME, not your gut.
Here are the five stages.
F β Frame the Workload
Every decision starts by framing what you are actually doing.
- Is it training, fine-tuning, or inference? This sets the entire memory profile.
- What model size in billions of parameters?
- Is it interactive or batch? Latency-bound or throughput-bound?
Framing prevents the most common category error: sizing inference like training, or vice versa. Skipping this stage means every later number rests on a guess. Our step-by-step guide covers the framing questions in detail.
R β Reckon the Memory
Reckon the memory footprint, because memory is the gate that decides feasibility.
- Base VRAM: parameters Γ 2 (FP16 inference), Γ 0.5 (4-bit), or Γ 16β20 (full training).
- Add 25 percent overhead.
- Account for context length, which inflates the KV cache.
If the workload does not fit in available memory, no amount of compute speed helps. This stage answers "is it even possible," a point our complete guide emphasizes.
A β Adjust With Optimization
Adjust the requirement downward before committing to hardware.
Quantization
Apply 8-bit by default and evaluate 4-bit on the real task. This alone often halves or quarters memory.
Model choice and batching
Consider a smaller model if quality holds, and plan batching for throughput workloads. These adjustments frequently move a workload down an entire GPU tier, the dynamic shown in our examples.
The key insight of FRAME is that A comes before hardware selection, not after.
M β Map to Hardware
Map your adjusted requirements onto an actual GPU tier and sourcing model.
- Match tier to worst-case memory: consumer (<24 GB), workstation (24β48 GB), datacenter (>48 GB).
- Check memory bandwidth for large-model inference, not just FLOPS.
- Choose buy, rent, or API by honest sustained utilization β own only above ~50β60 percent.
Mapping is mechanical once the earlier stages are done correctly, which is the whole point of doing them first.
E β Evaluate and Iterate
Evaluate against reality before scaling, then iterate.
- Run a small validation job and measure actual VRAM and throughput.
- Compare to your reckoned numbers and adjust.
- Instrument utilization so the decision stays honest over time.
FRAME is a loop, not a line. As traffic, models, or budgets change, you re-run it. The evaluation stage is what keeps the framework honest, echoing the measurement discipline in our best practices guide and guarding against the errors in our common mistakes breakdown.
Walking a Workload Through FRAME
To make the model concrete, trace one workload through all five stages.
Imagine a production summarization service. Frame: batch inference, a 13B model, throughput-bound. Reckon: 13B at FP16 is 26 GB plus overhead, roughly 33 GB. Adjust: quantize to 4-bit, dropping to about 9 GB, and plan aggressive batching. Map: the adjusted 9 GB fits a 24 GB card with room for large batches; because the work is interruptible, rent spot capacity rather than own. Evaluate: a small run confirms real VRAM and tokens per second, and a dashboard tracks utilization over time.
Notice how each stage constrained the next. Framing told you it was throughput-bound, which shaped the batching decision in Adjust. Adjust changed the memory number that Map consumed. By the time you reach hardware selection, the choice is nearly forced by the earlier work β which is exactly the point. The same end-to-end discipline appears in our step-by-step guide.
Why a Named Framework Beats Ad Hoc Decisions
You might reasonably ask whether a named model adds anything over just thinking carefully. It does, for three reasons.
- Consistency. Different people running FRAME reach the same answer, so decisions become reviewable rather than personal.
- Completeness. The stages are a checklist against forgetting β nobody skips the KV cache or the utilization estimate when the framework forces the question.
- Communicability. "We ran FRAME and landed on a 24 GB rented card" is a sentence a team can audit, challenge, and learn from.
The cost of ad hoc decisions is not that any single one is wrong; it is that they are inconsistent and unauditable, so the team never learns. A shared framework turns compute sizing into institutional knowledge instead of individual intuition, guarding against the very errors in our common mistakes guide.
Common Ways FRAME Goes Wrong
A framework is only as good as the discipline applied to it. Three failure modes recur.
The first is stage-skipping, usually jumping straight to Map because the team already has a GPU in mind. This inverts the logic and produces hardware chosen by habit rather than requirement. The second is stale framing, where a workload is run through FRAME once and never revisited even as the model or traffic changes β which is why Evaluate exists as a loop rather than a finish line. The third is dishonest reckoning, where teams plug in optimistic numbers, especially for utilization and context length, and get an answer that flatters their preferred decision.
Each failure has the same root: treating FRAME as a formality rather than a forcing function. The framework works precisely because each stage forces a question the team would otherwise skip. Run it honestly, in order, and revisit it on change, and it delivers consistent decisions. Run it as theater and it delivers the same ad hoc choices dressed up in acronyms. The discipline, not the acronym, is what saves money.
When to Apply FRAME
Use the full framework for any decision with real cost or risk: a new production workload, a major model change, a training run. For trivial prototyping on an API, a lightweight pass through Frame and Reckon is enough. The framework scales to the stakes β apply more rigor where more money is on the line. Pair it with our checklist for the granular items within each stage.
Frequently Asked Questions
Why does FRAME put optimization before hardware selection?
Because optimization changes which hardware you need. Quantization and model choice can drop a workload a whole GPU tier. Choosing hardware first means sizing against requirements you are about to reduce, leading to waste.
Is FRAME overkill for a small prototype?
For trivial prototyping on an API, run only the Frame and Reckon stages. The full loop is meant for decisions with real cost or risk, where its rigor prevents expensive mistakes.
What makes the Reckon stage the "gate"?
Memory feasibility is binary β either the model fits in VRAM or it does not. Compute speed is irrelevant if the model cannot load. Reckon answers feasibility before you spend effort on anything else.
How often should I re-run the framework?
Whenever a major input changes: a new model, a traffic shift, a budget change, or a fresh workload. FRAME is a loop. The Evaluate stage exists precisely to trigger re-runs when reality diverges from the plan.
Does FRAME replace a checklist?
No, they complement each other. FRAME gives you the reasoning model; a checklist gives you the granular items to tick off within each stage. Use the framework to think and the checklist to execute.
Key Takeaways
- FRAME β Frame, Reckon, Adjust, Map, Evaluate β turns compute sizing into a repeatable process.
- Frame the workload type and model size before any numbers; it sets the whole profile.
- Reckon memory first; feasibility is the binary gate that precedes everything else.
- Adjust with quantization and model choice before selecting hardware, not after.
- Map adjusted requirements to a GPU tier and sourcing model by worst case and honest utilization.
- Evaluate against a real test run and re-run the loop whenever a major input changes.