AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Setup: Before You Test AnythingFoundational ItemsVariation: Defining What You ChangeVariation ItemsExecution: Running the TestExecution ItemsScoring: Turning Outputs Into FindingsScoring ItemsMaintenance: Keeping the Test AliveMaintenance ItemsA Worked Pass Through the ListHow to Use This Checklist in PracticeFrequently Asked QuestionsDo I need to complete every item for every prompt?Why measure the randomness floor as a separate checklist item?How is this checklist different from the best-practices article?What does it mean to schedule re-runs, and how often?Can I add my own items to this checklist?What is the minimum viable version of this checklist?Key Takeaways
Home/Blog/Twenty Checks Before You Trust a Prompt in Production
General

Twenty Checks Before You Trust a Prompt in Production

A

Agency Script Editorial

Editorial Team

Β·April 19, 2020Β·8 min read
prompt sensitivity and robustness testingprompt sensitivity and robustness testing checklistprompt sensitivity and robustness testing guideprompt engineering

A checklist is only useful if you understand why each item is on it. A list of bare commands invites cargo-cult compliance, where you tick boxes without grasping what they protect against. This checklist pairs every item with its reasoning, so you can adapt it to your situation and skip items honestly when they do not apply.

Use it as a working tool. Before a prompt goes to production, or after a model update, walk the list and confirm each item. The checklist is organized into the natural phases of a robustness effort: setup, variation, execution, scoring, and maintenance.

It distills the practices argued at length in Opinions Earned the Hard Way on Prompt Robustness into something you can act on in one pass. For the full procedure behind it, see Build a Repeatable Robustness Test in One Afternoon.

Setup: Before You Test Anything

The setup phase determines whether the rest of the test means anything.

Foundational Items

  • Written success criterion exists. Without a definition of correct, your robustness rate is opinion. Write it before viewing outputs to keep your standard from drifting.
  • Criterion is machine-checkable where possible. Objective checks scale and stay consistent; reserve human judgment for genuinely qualitative parts.
  • Benchmark covers typical, edge, and adversarial inputs. A prompt that only sees clean inputs will surprise you on the first messy one.
  • Past production failures are in the benchmark. Real failures are your highest-value inputs and must never silently regress.
  • Stakes are assessed. Decide the consequence of failure first, so you can size your rigor to match rather than over- or under-testing.

Variation: Defining What You Change

This phase is where tests quietly go wrong if you are careless.

Variation Items

  • Variations preserve meaning. A variation that changes the request produces a "failure" that is actually correct β€” verify intent, ideally with a second reviewer.
  • One dimension changes per variation. Isolating changes lets you attribute every failure to a specific cause instead of guessing.
  • An unmodified baseline is retained. You measure variations against a fixed control, so the original must stay untouched.
  • Variation types cover wording, order, and format. Fragility hides in different dimensions; cover paraphrase, reordering, and formatting at minimum.

Execution: Running the Test

Execution is mechanical but has two settings that change what you learn.

Execution Items

  • Each prompt-input pair runs multiple times. Single runs cannot distinguish a real failure from sampling noise.
  • The randomness floor is measured. Run the exact prompt repeatedly first, so you know how much variation is noise before attributing any to sensitivity.
  • Low-temperature runs isolate sensitivity. When diagnosing how the prompt responds to edits, minimize sampling variability.
  • Production-temperature runs reflect reality. Test at the temperature you actually deploy, because that is the variability users will hit.
  • Raw outputs are captured. Saving outputs lets you re-examine failures without rerunning the whole suite.

Scoring: Turning Outputs Into Findings

Scoring converts a pile of outputs into something you can act on.

Scoring Items

  • Each output is scored pass or fail against the criterion. A clean robustness rate beats fuzzy grading and resists standard drift.
  • Failures are categorized by type. Missing field, wrong format, hallucination, ignored constraint β€” the category points to the fix.
  • Failure patterns are identified. A cluster of failures around long inputs or paraphrases is a real finding; a lone odd output is noise.

Maintenance: Keeping the Test Alive

The maintenance items are what separate a snapshot from an instrument.

Maintenance Items

  • Fixes trigger a full re-run. A fix in one area often breaks another, so re-test the whole suite to catch regressions.
  • The suite is saved as a reusable package. Benchmark, variations, criterion, and scoring together make re-running cost minutes, not hours.
  • Re-runs are scheduled and event-triggered. Hosted models drift silently, so re-test on prompt edits, model updates, new input classes, and on a schedule.

A Worked Pass Through the List

To make the checklist concrete, consider walking it for a new extraction prompt headed to production. In Setup, you write the success criterion β€” valid JSON, three required fields, no invented values β€” and confirm it is machine-checkable. You assemble a benchmark of a dozen real documents, deliberately including three that failed in an earlier prototype, and you note that the stakes are high because the output feeds an automated downstream system.

In Variation, you draft four variations that each change one thing β€” instruction position, delimiter style, a paraphrased instruction, and a reordered field list β€” and a colleague confirms each preserves the original request. You keep the original prompt as your baseline. In Execution, you run every prompt against every document five times, once measuring the randomness floor on the unchanged prompt, and you run at both low and production temperature, saving all outputs.

In Scoring, you mark each output pass or fail, discover that the paraphrased variation drops a field on long documents, and categorize that as a missing-field pattern tied to instruction position. In Maintenance, you reposition the instruction, re-run the entire suite to confirm no regression, save the suite, and schedule a monthly re-run. That single pass exercises every item, and the next time the model updates you re-enter only at Execution.

How to Use This Checklist in Practice

Do not treat the list as a one-time gate. The setup and variation items are mostly a one-time build; the execution, scoring, and maintenance items recur. For a new prompt, walk the whole list. For an established prompt after a model update, the maintenance items plus a re-run are usually enough. When an item does not apply β€” a throwaway prompt needs no adversarial benchmark β€” skip it deliberately and note why, rather than skipping it out of habit. The reasoning attached to each item is what lets you make that call. The competing approaches behind some of these choices are weighed in Prompt Sensitivity and Robustness Testing: Trade-offs, Options, and How to Decide, and the named structure that organizes them appears in The SCORE Model for Prompt Robustness Testing.

Frequently Asked Questions

Do I need to complete every item for every prompt?

No. Walk the full list for new or high-stakes prompts, but for an established prompt after a routine model update, the maintenance items and a re-run usually suffice. The reasoning attached to each item lets you decide what genuinely applies. Skipping an item should be a deliberate judgment about stakes, not an unexamined habit.

Why measure the randomness floor as a separate checklist item?

Because without it, you cannot tell whether output differences come from your prompt edits or from the model's built-in sampling variability. Running the exact prompt repeatedly first establishes how much variation is just noise. Only differences exceeding that floor count as real sensitivity, which keeps you from chasing problems that are not actually fragility.

How is this checklist different from the best-practices article?

The best-practices article argues the reasoning at length and explains when to bend each rule. This checklist compresses those conclusions into an actionable pass you can run before shipping. Use the best-practices piece to understand the why deeply, and this checklist as the working tool you actually walk through under time pressure.

What does it mean to schedule re-runs, and how often?

It means running the saved suite automatically on a recurring basis, not only when you remember. Frequency depends on stakes and how often your model provider updates, but monthly is a reasonable default for many teams, with event-triggered runs on every prompt edit or model change. Scheduling removes the discipline problem because the test happens without anyone initiating it.

Can I add my own items to this checklist?

You should. This list covers the general failure modes, but your domain may have specific risks β€” regulatory constraints, language coverage, latency bounds β€” worth adding as items. The structure of pairing each item with its reasoning is the part to preserve, because that is what keeps the checklist from becoming mechanical box-ticking.

What is the minimum viable version of this checklist?

If you can do only a few things: write a success criterion, build a benchmark with adversarial inputs, run at production temperature, score pass or fail, and save the suite for re-running. Those five items capture most of the protective value. The remaining items refine accuracy and attribution, which matter more as stakes rise.

Key Takeaways

  • Each checklist item carries its reasoning, so you can adapt and skip honestly rather than ticking boxes by rote.
  • Setup items β€” a written criterion and an adversarial benchmark including past failures β€” determine whether the rest of the test means anything.
  • Variation items prevent the silent errors: preserve meaning, change one dimension at a time, and keep an unmodified baseline.
  • Execution and scoring separate sensitivity from randomness, test at both temperatures, and convert outputs into categorized, pattern-level findings.
  • Maintenance items turn the test into an instrument: re-run fully after fixes and schedule recurring runs to catch silent model drift.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification