Sizing GPUs the Fifth Time Should Be Routine, Not Research

The first time a team sizes GPUs, it's research. The fifth time, it should be a workflow. If every new model still triggers a panicked round of VRAM math and vendor comparisons, you haven't built a process, you've just survived a problem five times. The difference between those two states is documentation and repeatability.

This article is about converting compute planning into a workflow that survives handoffs. The test is simple: could a new engineer, handed your documentation, size and provision a workload correctly without tapping the person who did it last time? If not, you have tribal knowledge, and tribal knowledge evaporates when people leave.

We'll build the workflow in stages: capture the inputs, standardize the decisions, document the handoff, and close the loop with review. For the underlying concepts each stage relies on, A Step-by-Step Approach to Ai Compute and Gpu Requirements is the companion reference.

Stage 1: Standardize the intake

A repeatable workflow starts with a repeatable input. Every compute request should arrive in the same shape, captured in the same template, so no one is guessing what's missing.

The intake template

Require these fields for every request:

Model name and parameter count.
Target precision (16-bit, 8-bit, 4-bit).
Workload type: inference, fine-tuning, or training.
Expected concurrency and latency target.
Duration: one-off experiment or ongoing service.
Budget ceiling.

When intake is standardized, the person sizing the workload never has to chase down basics. The template does the chasing. This single change eliminates most of the back-and-forth that makes ad hoc planning slow.

Stage 2: Encode the sizing logic

The math for memory and throughput shouldn't live in someone's head. Encode it as a documented procedure or, better, a small calculator that takes the intake fields and outputs a memory floor and a throughput target.

Memory floor: roughly 2 GB per billion parameters at 16-bit, adjusted for precision, plus 25 to 40 percent overhead.
Throughput target: derived from concurrency and latency requirements.
Training multiplier: 4x to 6x for full tuning, far less for parameter-efficient methods.

The point isn't precision to the gigabyte. The point is that two different engineers, given the same intake, produce the same sizing. That consistency is what makes the workflow trustworthy. The common errors this stage prevents are catalogued in 7 Common Mistakes with Ai Compute and Gpu Requirements (and How to Avoid Them).

Stage 3: Make the hardware decision a lookup

Once you have a memory floor and throughput target, choosing hardware should be a lookup, not a debate. Maintain a current table mapping requirement ranges to recommended options, with the rent-or-buy guidance attached.

The decision table

Small models, low concurrency: consumer-class card or CPU, rent first.
Mid-size models, moderate load: high-memory consumer or entry data-center card.
Large models or high concurrency: data-center card, evaluate owning if utilization is sustained.
Training at scale: multi-GPU node with fast interconnect.

Update this table when prices or cards shift, ideally during the monthly cost review. The table turns a recurring research task into a five-minute reference check.

Stage 4: Document the provisioning steps

Provisioning is where undocumented workflows leak. The person who set up the last environment knows the quirks; nobody else does. Write the steps down as an executable runbook.

The exact provisioning commands or console steps.
Monitoring setup for utilization, memory, temperature, and cost.
Budget alerts and quotas.
Validation: a quick test confirming the GPU is reachable, the model loads, and utilization climbs under load.

Better still, capture as much of this as code. Infrastructure defined in configuration files is self-documenting and reproducible in a way that a wiki page never is.

Stage 5: Build the optimization loop

A workflow that stops at provisioning is incomplete. The most expensive recurring failure is idle hardware, and catching it requires a standing loop, not a one-time check.

The recurring loop

Weekly: review utilization dashboards; flag anything chronically below 70 percent.
On each flag: profile to find the bottleneck (data loading, batch size, preprocessing, synchronization) and tune.
Monthly: reconcile cost against forecast; downsize or shut down underused resources.

Document who runs this loop and on what cadence. An optimization step with no owner is an optimization step that doesn't happen.

Stage 6: Make it hand-off-able

The final stage is the test of whether you actually have a workflow. Bundle the intake template, the sizing calculator, the decision table, the provisioning runbook, and the optimization loop into one place a new person can find.

Store everything in a single, discoverable location, not scattered across chats and notebooks.
Include a worked example: one real request taken end to end.
Note the failure modes and how the workflow caught them.

A worked example does more teaching than any abstract description. When you can hand someone the bundle plus one example and they can run the next request unaided, the workflow is real. For inspiration on what good documentation of real cases looks like, see Case Study: Ai Compute and Gpu Requirements in Practice.

Common failure modes the workflow prevents

It helps to be explicit about what goes wrong without a workflow, because those failures are what the structure is buying you protection against.

The repeated research tax. Every new model triggers the same VRAM math from scratch because nobody wrote it down. The encoded sizing logic in Stage 2 eliminates this entirely.
The handoff cliff. The one engineer who knows how to provision leaves, and provisioning becomes a multi-day archaeology project. The runbook in Stage 4 turns that into a checklist anyone can follow.
The silent idle GPU. Hardware runs at a fraction of capacity for months because no one is watching utilization. The optimization loop in Stage 5 catches it within a week.
The inconsistent estimate. Two engineers size the same workload differently and one is badly wrong. Standardized intake plus encoded math forces convergence.

Naming these failures in your documentation does double duty: it justifies the workflow to skeptics and tells the next person what each stage is actually for.

Frequently Asked Questions

How much should I document versus automate?

Automate the parts that are mechanical and repetitive: the sizing math, the provisioning steps, the monitoring setup. Document the parts that require judgment: the rent-or-buy reasoning, the trade-offs behind the decision table. The goal is that mechanical steps run themselves and judgment steps are at least guided by written reasoning rather than improvised each time.

What's the minimum viable version of this workflow?

An intake template and a sizing procedure. Those two artifacts alone eliminate most of the chaos, because they standardize what goes in and how it's evaluated. You can add the decision table, provisioning runbook, and optimization loop as the workflow matures, but the intake-plus-sizing pair is the irreducible core.

How do I keep the workflow from going stale?

Tie its maintenance to a recurring event you already run, like the monthly cost review. During that review, update the decision table with current prices and cards, and check whether any provisioning steps have changed. A workflow maintained on a schedule stays current; one maintained only when it breaks is always slightly wrong.

Who owns the workflow?

Someone has to, or it decays. Usually a platform or infrastructure lead owns the documentation and the decision table, while individual engineers own running the workflow for their own requests. The owner's job is keeping the shared artifacts accurate, not personally sizing every workload.

Does a small team really need this?

A small team needs it more, because a small team can't afford the time lost re-researching the same questions. The workflow doesn't have to be elaborate; even a single shared document with the intake template, sizing math, and a decision table saves hours and prevents the worst sizing mistakes.

Key Takeaways

A repeatable workflow starts with a standardized intake template so no request arrives missing the basics.
Encode the sizing math so any engineer produces the same memory floor and throughput target from the same inputs.
Turn hardware selection into a lookup table maintained on a schedule, not a recurring debate.
Document provisioning as an executable runbook, ideally as infrastructure-as-code that is self-documenting.
Close the loop with a weekly utilization review and monthly cost reconciliation, each with a named owner.

Stage 1: Standardize the intake

A repeatable workflow starts with a repeatable input. Every compute request should arrive in the same shape, captured in the same template, so no one is guessing what's missing.

The intake template

Require these fields for every request:

Model name and parameter count.
Target precision (16-bit, 8-bit, 4-bit).
Workload type: inference, fine-tuning, or training.
Expected concurrency and latency target.
Duration: one-off experiment or ongoing service.
Budget ceiling.

Stage 2: Encode the sizing logic

Memory floor: roughly 2 GB per billion parameters at 16-bit, adjusted for precision, plus 25 to 40 percent overhead.
Throughput target: derived from concurrency and latency requirements.
Training multiplier: 4x to 6x for full tuning, far less for parameter-efficient methods.

Stage 3: Make the hardware decision a lookup

The decision table

Small models, low concurrency: consumer-class card or CPU, rent first.
Mid-size models, moderate load: high-memory consumer or entry data-center card.
Large models or high concurrency: data-center card, evaluate owning if utilization is sustained.
Training at scale: multi-GPU node with fast interconnect.

Update this table when prices or cards shift, ideally during the monthly cost review. The table turns a recurring research task into a five-minute reference check.

Stage 4: Document the provisioning steps

Provisioning is where undocumented workflows leak. The person who set up the last environment knows the quirks; nobody else does. Write the steps down as an executable runbook.

The exact provisioning commands or console steps.
Monitoring setup for utilization, memory, temperature, and cost.
Budget alerts and quotas.
Validation: a quick test confirming the GPU is reachable, the model loads, and utilization climbs under load.

Better still, capture as much of this as code. Infrastructure defined in configuration files is self-documenting and reproducible in a way that a wiki page never is.

Stage 5: Build the optimization loop

A workflow that stops at provisioning is incomplete. The most expensive recurring failure is idle hardware, and catching it requires a standing loop, not a one-time check.

The recurring loop

Weekly: review utilization dashboards; flag anything chronically below 70 percent.
On each flag: profile to find the bottleneck (data loading, batch size, preprocessing, synchronization) and tune.
Monthly: reconcile cost against forecast; downsize or shut down underused resources.

Document who runs this loop and on what cadence. An optimization step with no owner is an optimization step that doesn't happen.

Stage 6: Make it hand-off-able

Store everything in a single, discoverable location, not scattered across chats and notebooks.
Include a worked example: one real request taken end to end.
Note the failure modes and how the workflow caught them.

Common failure modes the workflow prevents

It helps to be explicit about what goes wrong without a workflow, because those failures are what the structure is buying you protection against.

The repeated research tax. Every new model triggers the same VRAM math from scratch because nobody wrote it down. The encoded sizing logic in Stage 2 eliminates this entirely.
The handoff cliff. The one engineer who knows how to provision leaves, and provisioning becomes a multi-day archaeology project. The runbook in Stage 4 turns that into a checklist anyone can follow.
The silent idle GPU. Hardware runs at a fraction of capacity for months because no one is watching utilization. The optimization loop in Stage 5 catches it within a week.
The inconsistent estimate. Two engineers size the same workload differently and one is badly wrong. Standardized intake plus encoded math forces convergence.

Naming these failures in your documentation does double duty: it justifies the workflow to skeptics and tells the next person what each stage is actually for.

Frequently Asked Questions

How much should I document versus automate?

What's the minimum viable version of this workflow?

How do I keep the workflow from going stale?

Who owns the workflow?

Does a small team really need this?

Key Takeaways

A repeatable workflow starts with a standardized intake template so no request arrives missing the basics.
Encode the sizing math so any engineer produces the same memory floor and throughput target from the same inputs.
Turn hardware selection into a lookup table maintained on a schedule, not a recurring debate.
Document provisioning as an executable runbook, ideally as infrastructure-as-code that is self-documenting.
Close the loop with a weekly utilization review and monthly cost reconciliation, each with a named owner.

Sizing GPUs the Fifth Time Should Be Routine, Not Research

Stage 1: Standardize the intake

The intake template

Stage 2: Encode the sizing logic

Stage 3: Make the hardware decision a lookup

The decision table

Stage 4: Document the provisioning steps

Stage 5: Build the optimization loop

The recurring loop

Stage 6: Make it hand-off-able

Common failure modes the workflow prevents

Frequently Asked Questions

How much should I document versus automate?

What's the minimum viable version of this workflow?

How do I keep the workflow from going stale?

Who owns the workflow?

Does a small team really need this?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Sizing GPUs the Fifth Time Should Be Routine, Not Research

Stage 1: Standardize the intake

The intake template

Stage 2: Encode the sizing logic

Stage 3: Make the hardware decision a lookup

The decision table

Stage 4: Document the provisioning steps

Stage 5: Build the optimization loop

The recurring loop

Stage 6: Make it hand-off-able

Common failure modes the workflow prevents

Frequently Asked Questions

How much should I document versus automate?

What's the minimum viable version of this workflow?

How do I keep the workflow from going stale?

Who owns the workflow?

Does a small team really need this?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?