Almost every question about AI compute and GPU requirements traces back to the same anxiety: someone is about to spend real money on hardware or cloud credits and they don't want to be wrong. The questions sound technical, but most of them are budgeting questions wearing a technical costume. How much VRAM? Which card? How many GPUs? What you're really asking is whether the bill will match the workload.
This piece answers the questions people actually type into a search bar, in plain language, with the trade-offs spelled out. We've grouped them by the order you tend to hit them: sizing, hardware choices, cost, and the operational gotchas nobody warns you about until you've already paid for them.
If you want the underlying mental model rather than discrete answers, start with The Complete Guide to Ai Compute and Gpu Requirements. This article assumes you want answers first and theory second.
How much GPU memory do I actually need?
This is the single most common question, and the answer is governed by one variable: model size, expressed in parameters and precision.
A rough rule for inference: a model in 16-bit precision needs about 2 GB of VRAM per billion parameters, plus 20 to 40 percent overhead for the key-value cache, activations, and the framework itself. A 7-billion-parameter model wants roughly 16 to 18 GB. A 70-billion-parameter model wants 140 GB-plus, which means multiple cards.
Quantization changes the math
Running the same model at 8-bit roughly halves memory; 4-bit roughly quarters it. A 70B model that needs 140 GB at full precision can fit in around 40 GB at 4-bit, which suddenly makes a single high-memory card plausible. The trade-off is accuracy: 8-bit is usually indistinguishable, 4-bit is acceptable for many tasks, and below that quality degrades fast.
Training needs far more
Fine-tuning or training multiplies the requirement. You need memory for the model, the gradients, the optimizer states, and the activations. A safe planning number for full fine-tuning is 4x to 6x the inference footprint. Parameter-efficient methods like LoRA cut this dramatically, often letting you fine-tune a model on hardware that could only run inference before.
Do I need an H100, or will a consumer card do?
Most people overestimate what they need. A consumer card with 24 GB of VRAM handles a surprising amount: 7B and 13B models comfortably, quantized larger models, and plenty of fine-tuning experiments. Data-center cards like the H100 earn their price in three situations: very large models that need their memory bandwidth and capacity, high-concurrency serving where throughput matters, and training runs where time-to-result is money.
The honest framing is throughput per dollar, not raw speed. A data-center card may be three times faster but cost ten times more. If your workload is bursty or experimental, the consumer card often wins on economics.
Should I rent cloud GPUs or buy hardware?
Rent when your usage is spiky, your needs are still changing, or you need to scale past a few cards occasionally. Buy when you have steady, predictable, around-the-clock utilization for many months.
The break-even reality
Cloud pricing looks cheap by the hour until you multiply it across continuous use. Heavy, sustained workloads frequently break even against owned hardware somewhere in the 6-to-18-month range, depending on the card and the discounts you can negotiate. The mistake is comparing the sticker price of a GPU against one month of cloud and concluding cloud is cheaper. The real comparison is total cost over the asset's useful life, including power, cooling, and the engineering time to run it.
For a structured way to make this call, the A Framework for Ai Compute and Gpu Requirements lays out the decision tree.
Why is my expensive GPU sitting at 30 percent utilization?
This is the question that should be asked more often. People buy capacity and then waste it. The usual culprits:
- Data loading bottlenecks. Your GPU is waiting on disk or network I/O. The fix is faster storage, prefetching, and more data-loader workers.
- Small batch sizes. Underfilling the GPU leaves compute idle. Increase batch size until memory is the limit.
- CPU preprocessing. Tokenization or augmentation on a slow CPU starves the GPU.
- Synchronization overhead. In multi-GPU setups, poor interconnect or naive parallelism strategy can cap scaling.
Profiling before you buy more hardware is the highest-ROI move available. We cover the recurring versions of this in 7 Common Mistakes with Ai Compute and Gpu Requirements (and How to Avoid Them).
What about inference at scale versus a single user?
A demo for one user and a production service for thousands are different planning problems. For single-user or low-concurrency work, you size for the model to fit and run at acceptable latency. For high concurrency, you size for throughput, which means batching requests, managing the key-value cache aggressively, and often using a serving framework built for the job.
Tokens-per-second is the wrong sole metric
Latency per request and total throughput pull in opposite directions. Large batches maximize tokens per second but make each individual user wait. The right target depends on your product: a chat interface tolerates a few hundred milliseconds; a batch summarization job tolerates seconds. Decide which you're optimizing before you pick hardware.
How do I plan for multi-GPU and networking?
Once a model or workload exceeds a single card, the interconnect between GPUs becomes a first-class concern. High-bandwidth links between cards inside a node, and fast networking between nodes, determine whether your second GPU gives you 1.9x or 1.3x the performance.
Two practical implications:
- For training large models, GPUs in the same node with a fast internal link dramatically outperform the same GPUs spread across slow networking.
- For inference, you often don't need exotic interconnect; many serving setups shard a model across cards in one box and call it done.
If multi-GPU is in your future, Real-World Examples and Use Cases shows how teams actually configured their clusters.
Frequently Asked Questions
How much VRAM do I need to run a 13B model?
In 16-bit precision, plan for roughly 28 to 32 GB including overhead, which means it won't fit on a single 24 GB card without quantization. At 8-bit it fits comfortably in 24 GB, and at 4-bit it fits with room to spare. For most practical uses, an 8-bit quantized 13B model on a 24 GB card is a sweet spot.
Is more VRAM or faster VRAM more important?
It depends on whether you're memory-bound or bandwidth-bound. If the model doesn't fit, capacity is everything and speed is irrelevant. Once it fits, memory bandwidth becomes the main driver of generation speed for large language models, which is why high-bandwidth data-center cards feel so much faster on big models.
Can I use multiple cheap GPUs instead of one expensive one?
Sometimes. For inference where a model is sharded across cards, multiple cheaper GPUs can work and save money. For training, the interconnect penalty and added complexity often erase the savings. The cheap-multi-GPU path rewards people who are comfortable with the operational overhead and punishes those who aren't.
How long will a GPU stay relevant before I need to upgrade?
A capable card typically remains useful for three to four years, but model sizes and efficiency techniques move fast. The hedge is to buy for the workload you have, not the workload you imagine, and to favor renting when your needs are still in flux. Hardware bought speculatively is hardware that depreciates while idle.
Do I need a GPU at all for inference?
Not always. Small models and quantized mid-size models run acceptably on modern CPUs for low-volume use, and specialized accelerators exist for edge deployment. The GPU becomes necessary when latency, concurrency, or model size pushes past what general-purpose compute handles. Don't buy one until you've confirmed you actually need it.
Key Takeaways
- VRAM sizing follows a simple rule: about 2 GB per billion parameters in 16-bit, plus overhead, with quantization cutting that in half or quarters.
- Most teams overbuy hardware; consumer cards cover more workloads than people assume, and data-center cards earn their cost only at scale.
- Rent-versus-buy is a utilization question; sustained, predictable load favors owning, while spiky or evolving needs favor cloud.
- Low GPU utilization is usually a data-pipeline or batch-size problem, not a hardware shortage, so profile before you purchase more.
- Inference at scale is a throughput-and-batching problem, distinct from single-user latency, and the two goals trade off against each other.