Compute is the single biggest line item in most serious AI projects, and it is the one teams plan last. They pick a model, write the prompts, sketch the product, and only when the bill arrives do they discover that the GPU they assumed was "good enough" cannot hold the model in memory, or that a job they thought would cost a few hundred dollars is quietly running at ten times that rate.
This guide is the definitive map of how AI compute and GPU requirements actually work. It is written for people who need to make real budget and architecture decisions: which accelerator to rent, how much memory a workload needs, when to train versus fine-tune versus call an API, and how to avoid paying for capacity you never use. We will stay concrete and name specific trade-offs rather than offering vague reassurance.
By the end you should be able to look at a workload, estimate its compute footprint within a sensible range, and choose hardware deliberately instead of by accident.
What "Compute" Actually Means in AI
Compute is shorthand for three distinct resources that people often collapse into one word. Separating them is the first real skill.
- Processing throughput — measured in FLOPS (floating-point operations per second). This determines how fast a model runs once it fits.
- Memory capacity — measured in gigabytes of VRAM on the GPU. This determines whether a model fits at all.
- Memory bandwidth — measured in GB/s. For inference on large models, this is often the real bottleneck, not raw FLOPS.
The common mistake is to shop by FLOPS alone. A card with enormous theoretical throughput but only 24 GB of VRAM cannot serve a 70-billion-parameter model in full precision, period. If you are starting from scratch, the beginner's guide walks through each of these terms slowly before you commit to numbers.
Training Versus Inference: Two Different Problems
These two workloads have almost nothing in common from a hardware perspective, and conflating them is how budgets blow up.
Training
Training updates a model's weights and requires holding the model, its gradients, and optimizer states in memory simultaneously. A rough rule: full fine-tuning needs roughly 16 to 20 bytes per parameter in memory. A 7-billion-parameter model can therefore demand well over 100 GB during full training, which is why training spans multiple GPUs.
Inference
Inference only runs the model forward. Memory needs drop to roughly 2 bytes per parameter at half precision, plus a smaller allocation for the context window. A 7B model fits comfortably on a single 24 GB card. This asymmetry is the most important thing to internalize: serving a model is far cheaper than building one.
Estimating VRAM for a Given Model
The practical question is always "will it fit." Use this back-of-envelope method.
- Take the parameter count in billions.
- Multiply by 2 for half precision (FP16/BF16), or by roughly 0.5 for 4-bit quantization.
- Add 20 to 40 percent overhead for the KV cache, activations, and framework bloat.
So a 13B model at FP16 needs about 26 GB before overhead, meaning a 24 GB card will not hold it but a 48 GB card will, with room to spare. The same model quantized to 4-bit drops to roughly 7 GB and runs on consumer hardware. Our step-by-step guide turns this math into a repeatable sizing procedure.
Choosing a GPU Tier
GPUs fall into rough tiers, and most teams never need the top one.
- Consumer (8–24 GB) — fine for inference on small or quantized models, prototyping, and learning. Cheap to rent or own.
- Prosumer / workstation (24–48 GB) — handles mid-size inference and light fine-tuning.
- Datacenter (40–80+ GB) — required for training large models, high-throughput serving, and any job needing fast interconnect across multiple cards.
The honest advice: start one tier lower than you think you need. Quantization, batching, and smaller models close most gaps, and you can always scale up. We compare specific options in our tools roundup.
Buy, Rent, or Call an API
This is the decision with the largest financial consequences.
- API (managed inference) — zero hardware ownership, pay per token. Best when usage is spiky or modest. The wrong choice once volume is steady and high.
- Rented cloud GPUs — flexible, hourly. Excellent for training runs and bursty workloads. Easy to leave idle and bleed money.
- Owned hardware — lowest cost per hour at high utilization, large upfront cost. Only sensible above roughly 50–60 percent sustained utilization.
The crossover math matters: a cloud GPU left running at 10 percent utilization costs more per useful hour than almost any alternative.
Multi-GPU and Interconnect
Once a workload outgrows a single card, a new variable appears: how fast the GPUs talk to each other.
When you split a model across multiple GPUs — whether for training or for serving something too large to fit on one card — the GPUs must constantly exchange data. The speed of the link between them, the interconnect, becomes a real bottleneck. Cards connected by a slow general-purpose bus will spend more time waiting on each other than computing.
- Single GPU — no interconnect concern; simplest and usually cheapest per unit of work.
- Multiple GPUs, one machine — fast direct links between cards make this far more efficient than spreading across machines.
- Multiple machines — network speed dominates, and efficiency drops unless the workload is designed for it.
The practical lesson: prefer fitting a workload on a single GPU or a single multi-GPU machine before spreading across a cluster. Each step outward adds communication overhead that eats into the compute you are paying for. This is also why quantization, which can keep a model on fewer cards, pays off twice — once in memory and once in avoided interconnect cost.
Context Length and the KV Cache
One factor surprises teams more than any other: the cost of long context windows.
When a model processes a long prompt or a long conversation, it stores intermediate state called the KV cache, and that cache grows with context length. A model that fits comfortably with a short context can run out of memory entirely when fed a very long document, even though the model weights have not changed.
This matters because context windows have grown dramatically. Planning VRAM around the model weights alone, then feeding the model long inputs in production, is a reliable way to hit out-of-memory errors after launch. Size your memory budget for the longest realistic context you will use, not the average, and treat the KV cache as a first-class line item in the math from earlier sections.
Common Failure Modes
A few patterns account for most wasted spend.
- Leaving rented GPUs running idle overnight or between jobs.
- Using full precision when quantization would be invisible to users.
- Over-provisioning VRAM "to be safe" instead of measuring.
- Training from scratch when fine-tuning or prompting would suffice.
Each of these has a fix, covered in depth in our common mistakes breakdown and codified in our best practices piece.
Frequently Asked Questions
How much GPU memory do I need to run a typical open model?
For a 7B model at FP16, plan on roughly 16–18 GB including overhead, which fits on a 24 GB card. Quantize to 4-bit and the same model drops under 8 GB. Always size for the largest model you intend to serve, not the smallest.
Is it cheaper to own a GPU or rent in the cloud?
Owning wins only at high, sustained utilization, typically above 50 to 60 percent. Below that, rented cloud GPUs or managed APIs are cheaper because you stop paying when the work stops. Idle owned hardware is pure loss.
Do I need a datacenter GPU to get started?
No. Most learning, prototyping, and inference on small or quantized models runs fine on consumer cards. Datacenter GPUs become necessary mainly for training large models or high-throughput production serving.
What is the difference between FLOPS and memory bandwidth?
FLOPS measures raw calculation speed; memory bandwidth measures how fast data moves to and from the GPU. For large-model inference, bandwidth is frequently the real limiter, so do not choose hardware on FLOPS alone.
Does quantization hurt quality?
Modern 8-bit quantization is usually indistinguishable in quality, and 4-bit is acceptable for many applications. It roughly halves or quarters memory needs, making it the highest-leverage optimization available.
Key Takeaways
- Treat compute as three resources: throughput, memory capacity, and bandwidth — never just one.
- Training and inference have completely different hardware profiles; inference is far cheaper.
- Estimate VRAM as parameters times 2 (FP16) plus 20–40 percent overhead before choosing a card.
- Start one GPU tier lower than instinct suggests and rely on quantization and batching.
- Own hardware only above ~50 percent sustained utilization; otherwise rent or use an API.
- Idle GPUs are the most common source of wasted AI budget.