AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What "Compute" Actually Means in AITraining Versus Inference: Two Different ProblemsTrainingInferenceEstimating VRAM for a Given ModelChoosing a GPU TierBuy, Rent, or Call an APIMulti-GPU and InterconnectContext Length and the KV CacheCommon Failure ModesFrequently Asked QuestionsHow much GPU memory do I need to run a typical open model?Is it cheaper to own a GPU or rent in the cloud?Do I need a datacenter GPU to get started?What is the difference between FLOPS and memory bandwidth?Does quantization hurt quality?Key Takeaways
Home/Blog/Compute Is the Biggest Line Item Teams Plan Dead Last
General

Compute Is the Biggest Line Item Teams Plan Dead Last

A

Agency Script Editorial

Editorial Team

·July 6, 2025·8 min read
ai compute and gpu requirementsai compute and gpu requirements guideai compute and gpu requirements guideai fundamentals

Compute is the single biggest line item in most serious AI projects, and it is the one teams plan last. They pick a model, write the prompts, sketch the product, and only when the bill arrives do they discover that the GPU they assumed was "good enough" cannot hold the model in memory, or that a job they thought would cost a few hundred dollars is quietly running at ten times that rate.

This guide is the definitive map of how AI compute and GPU requirements actually work. It is written for people who need to make real budget and architecture decisions: which accelerator to rent, how much memory a workload needs, when to train versus fine-tune versus call an API, and how to avoid paying for capacity you never use. We will stay concrete and name specific trade-offs rather than offering vague reassurance.

By the end you should be able to look at a workload, estimate its compute footprint within a sensible range, and choose hardware deliberately instead of by accident.

What "Compute" Actually Means in AI

Compute is shorthand for three distinct resources that people often collapse into one word. Separating them is the first real skill.

  • Processing throughput — measured in FLOPS (floating-point operations per second). This determines how fast a model runs once it fits.
  • Memory capacity — measured in gigabytes of VRAM on the GPU. This determines whether a model fits at all.
  • Memory bandwidth — measured in GB/s. For inference on large models, this is often the real bottleneck, not raw FLOPS.

The common mistake is to shop by FLOPS alone. A card with enormous theoretical throughput but only 24 GB of VRAM cannot serve a 70-billion-parameter model in full precision, period. If you are starting from scratch, the beginner's guide walks through each of these terms slowly before you commit to numbers.

Training Versus Inference: Two Different Problems

These two workloads have almost nothing in common from a hardware perspective, and conflating them is how budgets blow up.

Training

Training updates a model's weights and requires holding the model, its gradients, and optimizer states in memory simultaneously. A rough rule: full fine-tuning needs roughly 16 to 20 bytes per parameter in memory. A 7-billion-parameter model can therefore demand well over 100 GB during full training, which is why training spans multiple GPUs.

Inference

Inference only runs the model forward. Memory needs drop to roughly 2 bytes per parameter at half precision, plus a smaller allocation for the context window. A 7B model fits comfortably on a single 24 GB card. This asymmetry is the most important thing to internalize: serving a model is far cheaper than building one.

Estimating VRAM for a Given Model

The practical question is always "will it fit." Use this back-of-envelope method.

  1. Take the parameter count in billions.
  2. Multiply by 2 for half precision (FP16/BF16), or by roughly 0.5 for 4-bit quantization.
  3. Add 20 to 40 percent overhead for the KV cache, activations, and framework bloat.

So a 13B model at FP16 needs about 26 GB before overhead, meaning a 24 GB card will not hold it but a 48 GB card will, with room to spare. The same model quantized to 4-bit drops to roughly 7 GB and runs on consumer hardware. Our step-by-step guide turns this math into a repeatable sizing procedure.

Choosing a GPU Tier

GPUs fall into rough tiers, and most teams never need the top one.

  • Consumer (8–24 GB) — fine for inference on small or quantized models, prototyping, and learning. Cheap to rent or own.
  • Prosumer / workstation (24–48 GB) — handles mid-size inference and light fine-tuning.
  • Datacenter (40–80+ GB) — required for training large models, high-throughput serving, and any job needing fast interconnect across multiple cards.

The honest advice: start one tier lower than you think you need. Quantization, batching, and smaller models close most gaps, and you can always scale up. We compare specific options in our tools roundup.

Buy, Rent, or Call an API

This is the decision with the largest financial consequences.

  • API (managed inference) — zero hardware ownership, pay per token. Best when usage is spiky or modest. The wrong choice once volume is steady and high.
  • Rented cloud GPUs — flexible, hourly. Excellent for training runs and bursty workloads. Easy to leave idle and bleed money.
  • Owned hardware — lowest cost per hour at high utilization, large upfront cost. Only sensible above roughly 50–60 percent sustained utilization.

The crossover math matters: a cloud GPU left running at 10 percent utilization costs more per useful hour than almost any alternative.

Multi-GPU and Interconnect

Once a workload outgrows a single card, a new variable appears: how fast the GPUs talk to each other.

When you split a model across multiple GPUs — whether for training or for serving something too large to fit on one card — the GPUs must constantly exchange data. The speed of the link between them, the interconnect, becomes a real bottleneck. Cards connected by a slow general-purpose bus will spend more time waiting on each other than computing.

  • Single GPU — no interconnect concern; simplest and usually cheapest per unit of work.
  • Multiple GPUs, one machine — fast direct links between cards make this far more efficient than spreading across machines.
  • Multiple machines — network speed dominates, and efficiency drops unless the workload is designed for it.

The practical lesson: prefer fitting a workload on a single GPU or a single multi-GPU machine before spreading across a cluster. Each step outward adds communication overhead that eats into the compute you are paying for. This is also why quantization, which can keep a model on fewer cards, pays off twice — once in memory and once in avoided interconnect cost.

Context Length and the KV Cache

One factor surprises teams more than any other: the cost of long context windows.

When a model processes a long prompt or a long conversation, it stores intermediate state called the KV cache, and that cache grows with context length. A model that fits comfortably with a short context can run out of memory entirely when fed a very long document, even though the model weights have not changed.

This matters because context windows have grown dramatically. Planning VRAM around the model weights alone, then feeding the model long inputs in production, is a reliable way to hit out-of-memory errors after launch. Size your memory budget for the longest realistic context you will use, not the average, and treat the KV cache as a first-class line item in the math from earlier sections.

Common Failure Modes

A few patterns account for most wasted spend.

  • Leaving rented GPUs running idle overnight or between jobs.
  • Using full precision when quantization would be invisible to users.
  • Over-provisioning VRAM "to be safe" instead of measuring.
  • Training from scratch when fine-tuning or prompting would suffice.

Each of these has a fix, covered in depth in our common mistakes breakdown and codified in our best practices piece.

Frequently Asked Questions

How much GPU memory do I need to run a typical open model?

For a 7B model at FP16, plan on roughly 16–18 GB including overhead, which fits on a 24 GB card. Quantize to 4-bit and the same model drops under 8 GB. Always size for the largest model you intend to serve, not the smallest.

Is it cheaper to own a GPU or rent in the cloud?

Owning wins only at high, sustained utilization, typically above 50 to 60 percent. Below that, rented cloud GPUs or managed APIs are cheaper because you stop paying when the work stops. Idle owned hardware is pure loss.

Do I need a datacenter GPU to get started?

No. Most learning, prototyping, and inference on small or quantized models runs fine on consumer cards. Datacenter GPUs become necessary mainly for training large models or high-throughput production serving.

What is the difference between FLOPS and memory bandwidth?

FLOPS measures raw calculation speed; memory bandwidth measures how fast data moves to and from the GPU. For large-model inference, bandwidth is frequently the real limiter, so do not choose hardware on FLOPS alone.

Does quantization hurt quality?

Modern 8-bit quantization is usually indistinguishable in quality, and 4-bit is acceptable for many applications. It roughly halves or quarters memory needs, making it the highest-leverage optimization available.

Key Takeaways

  • Treat compute as three resources: throughput, memory capacity, and bandwidth — never just one.
  • Training and inference have completely different hardware profiles; inference is far cheaper.
  • Estimate VRAM as parameters times 2 (FP16) plus 20–40 percent overhead before choosing a card.
  • Start one GPU tier lower than instinct suggests and rely on quantization and batching.
  • Own hardware only above ~50 percent sustained utilization; otherwise rent or use an API.
  • Idle GPUs are the most common source of wasted AI budget.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification