AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What a Workflow Adds Over ImprovisationRepeatabilityHand-Off Without LossAn Audit TrailThe Core ArtifactsThe Test SetThe Run LogThe Failure RegistryStep One: Define the ContractState What the Prompt Must DoMake It ReviewableStep Two: Build the Test Set From the ContractDerive Cases MechanicallyCover Model VariationStep Three: Run and RecordExecute the Full SetTriage FailuresStep Four: Close the Loop From ProductionFeed Incidents BackSchedule Recurring RunsMaking the Workflow StickLower the FrictionReview the Workflow ItselfAssign Clear OwnershipFrequently Asked QuestionsHow is a testing workflow different from a testing playbook?What should the test set actually contain?How do we keep the workflow from being skipped under deadline?Who owns the workflow when prompts have different authors?How do production incidents fit into the workflow?Does this workflow scale to dozens of prompts?Key Takeaways
Home/Blog/Turning Prompt Robustness Checks Into a Documented Process
General

Turning Prompt Robustness Checks Into a Documented Process

A

Agency Script Editorial

Editorial Team

Β·December 30, 2019Β·8 min read
prompt sensitivity and robustness testingprompt sensitivity and robustness testing workflowprompt sensitivity and robustness testing guideprompt engineering

Most prompt testing lives in one person's head. They know which inputs tend to break a given prompt, which model behaves oddly with long context, and which edge cases bit them last quarter. That knowledge is real and valuable, and it walks out the door the moment that person takes a vacation or changes teams.

A workflow fixes this. Where a playbook tells you which plays to run, a workflow tells you how to run them the same way every time, who records what, and where the results live so the next person can pick up exactly where the last one stopped. It is the difference between a skill and a system.

This article walks through building that system for prompt sensitivity and robustness testing: the artifacts you maintain, the steps in order, the hand-off points, and the small disciplines that keep the process from rotting. The aim is a documented routine so clear that a new hire could run it on their first day.

What a Workflow Adds Over Improvisation

Repeatability

An improvised test gives you a different result depending on who runs it and what they remember to check. A workflow gives you the same coverage regardless of who is at the keyboard. That consistency is what lets you trust the green checkmark at the end.

Hand-Off Without Loss

When testing is a documented process with stored artifacts, transferring it is a matter of pointing someone at the documents. When it lives in someone's intuition, the transfer requires shadowing, and most of the knowledge still evaporates. For the underlying plays this workflow orchestrates, see Stress-Testing Prompts Before They Reach a Client.

An Audit Trail

A workflow produces a record. When a client asks how you validated a prompt, or when something breaks and you need to know what changed, the artifacts answer the question. Improvisation leaves no trail.

The Core Artifacts

The Test Set

The center of the workflow is a versioned test set: a structured file of inputs paired with the properties their outputs must satisfy. Each entry names the case, supplies the input, and states what counts as a pass. This file is committed alongside the prompt and travels with it.

  • Store inputs and pass criteria together, not in separate places
  • Version the test set so you can see how coverage grew over time
  • Treat the file as the prompt's specification, not an afterthought

The Run Log

Every test run produces a log: which prompt version, which model version, which cases passed, and the raw outputs for any that failed. The log is dated and kept. Over time these logs become the history of the prompt's reliability.

The Failure Registry

When a prompt fails in production, the incident gets a registry entry: what input triggered it, what the prompt did wrong, and the new test case added to prevent recurrence. The registry is how production teaches the test set.

Step One: Define the Contract

State What the Prompt Must Do

Before testing anything, write down the prompt's contract in plain language: what inputs it accepts, what output structure it must always produce, and what it must never do. This contract is the source of truth every test case derives from. Without it, you are testing against a moving target.

  • Specify the required output fields and their types
  • Specify the forbidden behaviors, such as leaking instructions
  • Specify how the prompt should behave on invalid input

Make It Reviewable

The contract gets reviewed by someone other than the author. A second reader catches unstated assumptions, the requirements the author considered too obvious to write down, which are exactly the ones that cause disputes later.

Step Two: Build the Test Set From the Contract

Derive Cases Mechanically

Each clause in the contract generates test cases. A required field generates a case that checks the field is present. A forbidden behavior generates an adversarial case that tries to provoke it. This mechanical derivation ensures coverage tracks the contract instead of the author's mood.

  • One pass case and one adversarial case per contract clause, minimum
  • Add paraphrase variants for any user-facing input
  • Add boundary cases for empty, oversized, and malformed inputs

Cover Model Variation

If the prompt might run on more than one model, the test set needs to assert behavior on each. Different architectures fail differently, which is why model selection deserves its own deliberate treatment in A Step-by-Step Approach to Prompting Across Different Model Architectures.

Step Three: Run and Record

Execute the Full Set

A test run executes every case against the current prompt and model, then writes a run log. Partial runs are not runs; the value comes from full coverage every time, so a regression in an untested case cannot slip through.

Triage Failures

Every failure gets a verdict: real defect, acceptable variation, or test that needs fixing. The verdict is recorded next to the failure. Untriaged failures are how a test suite loses credibility, because once people start ignoring red they ignore all of it.

Step Four: Close the Loop From Production

Feed Incidents Back

When a prompt misbehaves in the wild, the workflow requires that the incident become a registry entry and a new test case before the fix is considered complete. This is the loop that makes the test set smarter over time instead of staying frozen at launch-day knowledge.

  • Reproduce the failure as a test case before fixing it
  • Confirm the new case fails on the old prompt and passes on the fix
  • Keep the case forever so the bug cannot return unnoticed

Schedule Recurring Runs

Because models drift, the workflow includes a recurring run even when nothing changes on your side. A scheduled monthly pass catches the silent shifts that no edit on your part would otherwise reveal.

Making the Workflow Stick

Lower the Friction

A workflow people skip is worthless. Wrap the test run in a single command, keep the artifacts in the same repository as the prompt, and make the run part of the definition of done. The less ceremony, the more it actually happens.

Review the Workflow Itself

Once a quarter, review the workflow the way you review the prompts. Are cases stale? Is the contract still accurate? Has a class of failure emerged that the test set does not cover? The process needs maintenance just like the artifacts do.

Assign Clear Ownership

Name an owner for the workflow as a whole, separate from the owners of individual prompts. That person keeps the artifacts healthy, ensures runs happen on schedule, and onboards new contributors to the process.

Frequently Asked Questions

How is a testing workflow different from a testing playbook?

A playbook lists the plays and when to run them. A workflow specifies how to run them identically every time, what artifacts to produce, and how to hand the process off. The playbook is the strategy; the workflow is the documented, repeatable operation that executes it.

What should the test set actually contain?

Inputs paired with pass criteria, derived from the prompt's written contract. Include happy-path cases, paraphrase variants, boundary cases, adversarial cases, and cross-model assertions. Store the inputs and the criteria together and version the whole file alongside the prompt.

How do we keep the workflow from being skipped under deadline?

Reduce friction and make it part of the definition of done. If running the full suite is a single command and the artifacts live next to the prompt, the cost of compliance drops below the cost of skipping. Deadlines erode any process that requires heroics.

Who owns the workflow when prompts have different authors?

Assign a single workflow owner distinct from the individual prompt authors. Authors maintain their own test cases, but one person keeps the overall process healthy, ensures scheduled runs happen, and onboards new contributors so the system survives turnover.

How do production incidents fit into the workflow?

Every production failure becomes a failure-registry entry and a new permanent test case before the fix counts as complete. You reproduce the bug as a case that fails on the old prompt, confirm it passes on the fix, and keep it forever so the same bug cannot quietly return.

Does this workflow scale to dozens of prompts?

It does, because the artifacts are uniform. Each prompt has the same contract, test set, and run log structure, so a teammate who learns the workflow on one prompt can run it on any of them. The uniformity is what makes scale and hand-off possible.

Key Takeaways

  • A workflow turns one person's testing intuition into a documented system anyone can run and hand off.
  • Three artifacts anchor it: a versioned test set, a dated run log, and a failure registry fed by production.
  • Derive test cases mechanically from a written, reviewed contract so coverage tracks requirements.
  • Close the loop by converting every production incident into a permanent new test case.
  • Lower friction and assign a clear workflow owner, or the process will quietly rot under deadline pressure.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification