AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Play 1: Qualify the Use CasePlay 2: Build the Evaluation Set FirstWhat goes in the setPlay 3: Prototype With PromptingPlay 4: Close the Accuracy GapPlay 5: Design the Human CheckpointThe checkpoint design questionsPlay 6: Control Cost Before LaunchPlay 7: Ship, Monitor, and IterateFrequently Asked QuestionsCan I skip the evaluation set to move faster?Who should own the whole playbook?How long does running the full playbook take?What if the baseline is already good enough?How is this different from a normal software rollout?Key Takeaways
Home/Blog/Plays You Run So You Stop Improvising Every Time
General

Plays You Run So You Stop Improvising Every Time

A

Agency Script Editorial

Editorial Team

·March 28, 2026·8 min read
multimodal AImultimodal AI playbookmultimodal AI guideai fundamentals

Most multimodal AI advice is a list of capabilities. A playbook is different. A playbook tells you what to do, when the moment to do it arrives, who runs it, and what comes next. It assumes you've already decided multimodal AI belongs in your work and now need to operate it without improvising every time.

This is structured as plays. Each play has a trigger that tells you when to run it, an owner who is accountable, and a sequence that connects it to the plays before and after. You don't run all of them at once. You run the one whose trigger just fired. The sequencing is the part teams get wrong: they jump to deployment before they've built an evaluation set, then spend months guessing why production accuracy doesn't match the demo.

If you want the conceptual grounding underneath these moves, A Framework for Multimodal AI covers the why. This piece is the what and when.

Play 1: Qualify the Use Case

Trigger: Someone proposes a multimodal feature, or a manual process involving images, documents, or audio becomes a bottleneck.

Owner: Product lead.

The first play is saying no to bad fits before they cost you anything. A use case qualifies when three things are true: the input genuinely needs visual or audio understanding, the cost of an occasional wrong answer is survivable or catchable, and there's a measurable definition of "correct."

If a task is really just text with extra steps, route it to a text-only solution and move on. If a wrong answer is catastrophic and uncatchable, the play is to redesign for a human checkpoint or abandon it. Write the qualification down in two or three sentences. That note becomes the contract everyone references later when scope drifts.

Play 2: Build the Evaluation Set First

Trigger: A use case has passed qualification.

Owner: Whoever will be accountable for accuracy, usually an engineer or analyst.

Before you write a single prompt, assemble 100 to 300 real examples with known correct answers. Real means actual documents, photos, and recordings from your domain, including the ugly ones: blurry scans, rotated pages, accented speech, edge cases. The evaluation set is the most important artifact you'll build, and it must exist before the model does, or you'll have no way to tell whether changes help.

What goes in the set

  • A representative spread of normal cases.
  • A deliberate slice of hard and adversarial cases.
  • The correct answer for each, labeled by a human.
  • Notes on which errors are expensive versus cosmetic.

This play feeds every later one. Skip it and the entire sequence collapses into guesswork.

Play 3: Prototype With Prompting

Trigger: Evaluation set exists.

Owner: Engineer.

Now you build the simplest thing that could work: a hosted multimodal model, a clear prompt, and your evaluation set. Run it, score it, read the failures. The goal of this play is not a great result. It's a baseline number and a list of failure patterns.

Resist the urge to optimize here. You're learning what the model does naturally so you know what's left to fix. Many teams discover the baseline is already good enough for an assistive feature, which means several later plays become unnecessary. For prototyping technique, A Step-by-Step Approach to Multimodal AI walks the mechanics.

Play 4: Close the Accuracy Gap

Trigger: Baseline exists but falls short of the bar you set in qualification.

Owner: Engineer, with product reviewing trade-offs.

Now you have a specific gap, so you can attack it in order of cheapness:

  1. Improve the prompt — clearer instructions, examples, structured output formats. Free and fast.
  2. Fix the input — downscale or crop images, deskew scans, isolate the region of interest.
  3. Add retrieval or tools — give the model reference data, or let it call a calculator for the counting it's bad at.
  4. Fine-tune — only if the first three plateau and you have the data volume to justify it.

Run them in that sequence and re-score after each. Most gaps close at step one or two. Teams that start at step four spend the most and learn the least. The patterns to avoid are catalogued in 7 Common Mistakes with Multimodal AI (and How to Avoid Them).

Play 5: Design the Human Checkpoint

Trigger: Accuracy is good but not perfect, and errors carry real cost.

Owner: Product lead, with operations.

Almost no multimodal system should run fully autonomous on high-stakes work. This play designs where the human sits. The good pattern is confidence-aware routing: the model handles high-confidence cases automatically and escalates low-confidence ones to a person.

The checkpoint design questions

  • What confidence threshold sends a case to review?
  • How does a reviewer see the model's reasoning and the original input together?
  • How do reviewer corrections flow back into the evaluation set?

That last point matters. A checkpoint that doesn't feed corrections back is a cost center. One that does is a continuously improving system.

Play 6: Control Cost Before Launch

Trigger: The system works and you're projecting production volume.

Owner: Engineer.

Multimodal cost surprises arrive at scale, driven mostly by image resolution. The play is a pre-launch cost pass:

  • Confirm images are downscaled to the minimum resolution the task tolerates.
  • Cache anything referenced repeatedly.
  • Route easy cases to a cheaper model and reserve the expensive one for hard cases.
  • Estimate monthly cost at projected volume and pressure-test it against the budget.

Do this before launch, not after the first invoice. The Best Tools for Multimodal AI comparison helps you pick the right model tiers for routing.

Play 7: Ship, Monitor, and Iterate

Trigger: Cost is controlled and the checkpoint is in place.

Owner: Engineer for monitoring, product for iteration cadence.

Launch to a slice of traffic first. Monitor the metrics that matter: accuracy on a rolling sample, escalation rate, cost per request, and latency. Set alerts on drift, because input distributions shift when real users arrive with inputs your evaluation set never imagined.

The iteration loop is the same loop you've been running: new failures feed the evaluation set, the evaluation set drives the next round of Play 4. The playbook doesn't end at launch. It cycles.

Frequently Asked Questions

Can I skip the evaluation set to move faster?

No. It feels faster, but it's the move that makes every later step a guess. Without a labeled set you can't tell whether a prompt change helped, whether you're ready to launch, or whether production has drifted. Build it first; it pays back immediately.

Who should own the whole playbook?

A single product lead should own the sequence, with an engineer owning the technical plays. The failure mode is diffuse ownership where everyone assumes someone else is tracking accuracy. Name one accountable person per play and one owner for the whole arc.

How long does running the full playbook take?

For a focused use case, a qualified team can reach a monitored launch in a few weeks, not months. Most of the time goes into the evaluation set and the checkpoint design, not the model work. Trying to compress those two is exactly what causes the slow, painful version.

What if the baseline is already good enough?

Then you're lucky, and you skip Play 4 entirely. Go straight to the human checkpoint and cost-control plays. The sequence is designed to let you exit early when the data says you can, which is more common than people expect for assistive features.

How is this different from a normal software rollout?

The evaluation set and the feedback loop. Traditional software is deterministic; you test once and it stays correct. Multimodal systems degrade as inputs shift, so monitoring and continuous re-evaluation are core plays, not afterthoughts.

Key Takeaways

  • Run the play whose trigger just fired, in sequence; jumping to deployment before evaluation is the most common and most expensive mistake.
  • The evaluation set is the load-bearing artifact and must exist before any prompting or model work.
  • Close accuracy gaps cheapest-first: prompt, then input, then tools, then fine-tuning as a last resort.
  • Design a confidence-aware human checkpoint that feeds corrections back into the evaluation set.
  • Control image resolution and routing before launch, and treat monitoring plus iteration as permanent, not a one-time event.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification