AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Prerequisites: What You Need FirstStep One: Establish a BaselineStep Two: Win the Cheap, Quality-Neutral ImprovementsTrim the PromptCap the OutputTurn On StreamingStep Three: Right-Size the ModelStep Four: Add Caching Where It PaysStep Five: Only Now, Consider InfrastructureYour First Week: A Realistic PlanDays One and Two: BaselineDays Three and Four: HarvestDay Five: Right-Size and CacheFrequently Asked QuestionsDo I need a serving framework to get started?What should I measure on my very first day?How do I know if I can use a smaller model?Why turn on streaming if it does not reduce total latency?What is the right order of operations?Key Takeaways
Home/Blog/Resist the Bigger GPU; Measure First Instead
General

Resist the Bigger GPU; Measure First Instead

A

Agency Script Editorial

Editorial Team

·October 24, 2025·7 min read
AI inference and latencyAI inference and latency getting startedAI inference and latency guideai fundamentals

You have a working AI feature. It is slow, you suspect you are overpaying, and you are not sure where to start. The instinct is to reach for a heavyweight serving framework or a bigger GPU. Resist it. The fastest credible path from a slow prototype to a measurably faster setup runs through measurement, a few cheap wins, and only then infrastructure. Most teams capture the majority of their possible latency improvement before touching their serving stack at all.

This guide is the zero-to-first-result path. It assumes you can call a model and get a response, and nothing more. By the end you will have a baseline measurement, two or three concrete improvements, and a clear sense of whether your remaining bottleneck is the prompt, the model, or the infrastructure. For deeper grounding on the concepts, AI Inference and Latency: A Beginner's Guide is a useful companion.

Prerequisites: What You Need First

Before optimizing anything, confirm three things are in place.

  • A representative test set. Twenty to fifty real requests that mirror your actual traffic in prompt length and complexity. Optimizing against toy inputs produces results that evaporate in production.
  • The ability to time requests. A way to record request start, first-token arrival, and completion. Without timing you are guessing.
  • A definition of "good enough." What latency would make the feature feel right? Write the target down before you start so you know when to stop.

That is the entire setup. You do not need new infrastructure yet.

Step One: Establish a Baseline

Run your test set and record the numbers. Capture, for each request, time to first token, total time, and the input and output token counts. Report the p50 and p95, not the average.

This baseline is the most valuable artifact in the whole process. It tells you where you stand, and every later change gets measured against it. If you skip it, you will not be able to tell whether your "optimizations" actually helped. The full instrumentation method is in How to Measure AI Inference and Latency.

Step Two: Win the Cheap, Quality-Neutral Improvements

Before changing models or hardware, harvest the free wins. These cost almost nothing and rarely hurt quality.

Trim the Prompt

Long system prompts inflate prefill time and cost on every single request. Cut redundant instructions, remove examples the model no longer needs, and shorten retrieved context to the chunks that actually matter. Shorter input means faster first token and lower cost simultaneously.

Cap the Output

Uncapped generation is a common silent latency killer. Set a sensible max output length and instruct the model to be concise. Many slow responses are slow only because the model rambled. Fewer output tokens directly shortens decode time.

Turn On Streaming

If your interface waits for the full response before showing anything, switch to streaming so tokens appear as they generate. This does not reduce total latency, but it slashes perceived latency, which is what users actually judge. For interactive features this is often the single highest-impact change.

Step Three: Right-Size the Model

Once the prompt and output are tight, ask whether you are using more model than the task needs.

Most teams default to the largest available model out of caution. Run your test set against a smaller or distilled model and compare quality on your real tasks, not on benchmarks. If the smaller model passes, you have just cut latency and cost together with one config change. Keep the large model as an escalation path for the hard cases you can detect. This is the foundation of the cascade pattern explored in Advanced AI Inference and Latency: Going Beyond the Basics.

Step Four: Add Caching Where It Pays

Some requests repeat. Some prompt prefixes are shared across every request. Both are caching opportunities.

  • Response caching for identical or near-identical queries returns an answer in milliseconds and costs nothing to generate.
  • Prompt prefix caching, supported by many serving setups, reuses the processed form of a shared system prompt so prefill is not repeated on every call.

Caching is the highest-leverage infrastructure move available to a beginner, because it removes work entirely rather than speeding it up.

Step Five: Only Now, Consider Infrastructure

If you have done the above and still miss your target, the bottleneck is likely capacity or serving efficiency. This is when a purpose-built serving framework, batching, or better hardware earns its complexity. Choose tools using The Best Tools for AI Inference and Latency, and avoid the early errors documented in 7 Common Mistakes with AI Inference and Latency.

The discipline here is sequencing. Teams that start with infrastructure spend weeks configuring batching to speed up requests that were slow because of a bloated prompt. Measure, harvest cheap wins, right-size the model, cache, then scale — in that order.

Your First Week: A Realistic Plan

To turn this into action, here is what a first week of getting started actually looks like. It is deliberately modest, because the goal is a real result, not a grand re-architecture.

Days One and Two: Baseline

Assemble your test set of real requests and wire up timing for first-token, completion, and token counts. Run it, record p50 and p95, and write down your target. You now have the single artifact every later decision references. Resist the urge to change anything yet — measuring honestly before touching the system is what makes the rest credible.

Days Three and Four: Harvest

Trim the system prompt, set an output cap, and turn on streaming. Re-run the baseline after each change so you can attribute the improvement to the specific edit. Most teams are surprised how much of their gap closes here, with no infrastructure and no quality loss. Document each delta; these numbers are also the start of the business case in The ROI of AI Inference and Latency.

Day Five: Right-Size and Cache

Run your test set against a smaller model and compare quality on your real tasks. If it passes, switch and keep the large model as escalation. Then add response caching for repeated queries and enable prompt prefix caching if your setup supports it. Re-measure one final time.

At the end of the week you will have a documented before-and-after, a faster system, and a clear answer to whether your remaining bottleneck justifies infrastructure work. That is a genuine first result, and it is the foundation for everything more advanced. If you want the conceptual grounding underneath these steps, read it alongside AI Inference and Latency: A Beginner's Guide.

Frequently Asked Questions

Do I need a serving framework to get started?

No. Most beginners capture the majority of their possible improvement through prompt trimming, output caps, streaming, model right-sizing, and caching — none of which require a serving framework. Reach for one only after those are exhausted and you still miss your target.

What should I measure on my very first day?

Time to first token, total time, and input and output token counts across a representative test set of twenty to fifty real requests. Report p50 and p95. This baseline becomes the reference for every later change.

How do I know if I can use a smaller model?

Run your real test set against the smaller model and compare quality on your actual tasks, not on public benchmarks. If it passes your bar, you have cut latency and cost in one change, with the large model kept as an escalation path for hard cases.

Why turn on streaming if it does not reduce total latency?

Because users judge perceived latency, not total latency. Streaming makes the first tokens appear almost immediately, which makes the feature feel responsive even when the full answer takes the same time. For interactive use it is often the highest-impact single change.

What is the right order of operations?

Measure a baseline, harvest cheap quality-neutral wins, right-size the model, add caching, and only then invest in infrastructure. Teams that reverse this order waste effort optimizing problems that a prompt edit would have solved.

Key Takeaways

  • Start with measurement; a baseline at p50 and p95 is your most valuable artifact.
  • Trim prompts and cap output for fast, quality-neutral wins before anything else.
  • Streaming cuts perceived latency dramatically without reducing total time.
  • Right-size the model against your real tasks, not benchmarks, and keep the large one as escalation.
  • Caching removes work entirely and is the best infrastructure move for beginners.
  • Sequence matters: measure, harvest, right-size, cache, then scale.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification