A checklist is only useful if you understand why each item is on it. A list of bare commands invites cargo-cult compliance, where you tick boxes without grasping what they protect against. This checklist pairs every item with its reasoning, so you can adapt it to your situation and skip items honestly when they do not apply.
Use it as a working tool. Before a prompt goes to production, or after a model update, walk the list and confirm each item. The checklist is organized into the natural phases of a robustness effort: setup, variation, execution, scoring, and maintenance.
It distills the practices argued at length in Opinions Earned the Hard Way on Prompt Robustness into something you can act on in one pass. For the full procedure behind it, see Build a Repeatable Robustness Test in One Afternoon.
Setup: Before You Test Anything
The setup phase determines whether the rest of the test means anything.
Foundational Items
- Written success criterion exists. Without a definition of correct, your robustness rate is opinion. Write it before viewing outputs to keep your standard from drifting.
- Criterion is machine-checkable where possible. Objective checks scale and stay consistent; reserve human judgment for genuinely qualitative parts.
- Benchmark covers typical, edge, and adversarial inputs. A prompt that only sees clean inputs will surprise you on the first messy one.
- Past production failures are in the benchmark. Real failures are your highest-value inputs and must never silently regress.
- Stakes are assessed. Decide the consequence of failure first, so you can size your rigor to match rather than over- or under-testing.
Variation: Defining What You Change
This phase is where tests quietly go wrong if you are careless.
Variation Items
- Variations preserve meaning. A variation that changes the request produces a "failure" that is actually correct β verify intent, ideally with a second reviewer.
- One dimension changes per variation. Isolating changes lets you attribute every failure to a specific cause instead of guessing.
- An unmodified baseline is retained. You measure variations against a fixed control, so the original must stay untouched.
- Variation types cover wording, order, and format. Fragility hides in different dimensions; cover paraphrase, reordering, and formatting at minimum.
Execution: Running the Test
Execution is mechanical but has two settings that change what you learn.
Execution Items
- Each prompt-input pair runs multiple times. Single runs cannot distinguish a real failure from sampling noise.
- The randomness floor is measured. Run the exact prompt repeatedly first, so you know how much variation is noise before attributing any to sensitivity.
- Low-temperature runs isolate sensitivity. When diagnosing how the prompt responds to edits, minimize sampling variability.
- Production-temperature runs reflect reality. Test at the temperature you actually deploy, because that is the variability users will hit.
- Raw outputs are captured. Saving outputs lets you re-examine failures without rerunning the whole suite.
Scoring: Turning Outputs Into Findings
Scoring converts a pile of outputs into something you can act on.
Scoring Items
- Each output is scored pass or fail against the criterion. A clean robustness rate beats fuzzy grading and resists standard drift.
- Failures are categorized by type. Missing field, wrong format, hallucination, ignored constraint β the category points to the fix.
- Failure patterns are identified. A cluster of failures around long inputs or paraphrases is a real finding; a lone odd output is noise.
Maintenance: Keeping the Test Alive
The maintenance items are what separate a snapshot from an instrument.
Maintenance Items
- Fixes trigger a full re-run. A fix in one area often breaks another, so re-test the whole suite to catch regressions.
- The suite is saved as a reusable package. Benchmark, variations, criterion, and scoring together make re-running cost minutes, not hours.
- Re-runs are scheduled and event-triggered. Hosted models drift silently, so re-test on prompt edits, model updates, new input classes, and on a schedule.
A Worked Pass Through the List
To make the checklist concrete, consider walking it for a new extraction prompt headed to production. In Setup, you write the success criterion β valid JSON, three required fields, no invented values β and confirm it is machine-checkable. You assemble a benchmark of a dozen real documents, deliberately including three that failed in an earlier prototype, and you note that the stakes are high because the output feeds an automated downstream system.
In Variation, you draft four variations that each change one thing β instruction position, delimiter style, a paraphrased instruction, and a reordered field list β and a colleague confirms each preserves the original request. You keep the original prompt as your baseline. In Execution, you run every prompt against every document five times, once measuring the randomness floor on the unchanged prompt, and you run at both low and production temperature, saving all outputs.
In Scoring, you mark each output pass or fail, discover that the paraphrased variation drops a field on long documents, and categorize that as a missing-field pattern tied to instruction position. In Maintenance, you reposition the instruction, re-run the entire suite to confirm no regression, save the suite, and schedule a monthly re-run. That single pass exercises every item, and the next time the model updates you re-enter only at Execution.
How to Use This Checklist in Practice
Do not treat the list as a one-time gate. The setup and variation items are mostly a one-time build; the execution, scoring, and maintenance items recur. For a new prompt, walk the whole list. For an established prompt after a model update, the maintenance items plus a re-run are usually enough. When an item does not apply β a throwaway prompt needs no adversarial benchmark β skip it deliberately and note why, rather than skipping it out of habit. The reasoning attached to each item is what lets you make that call. The competing approaches behind some of these choices are weighed in Prompt Sensitivity and Robustness Testing: Trade-offs, Options, and How to Decide, and the named structure that organizes them appears in The SCORE Model for Prompt Robustness Testing.
Frequently Asked Questions
Do I need to complete every item for every prompt?
No. Walk the full list for new or high-stakes prompts, but for an established prompt after a routine model update, the maintenance items and a re-run usually suffice. The reasoning attached to each item lets you decide what genuinely applies. Skipping an item should be a deliberate judgment about stakes, not an unexamined habit.
Why measure the randomness floor as a separate checklist item?
Because without it, you cannot tell whether output differences come from your prompt edits or from the model's built-in sampling variability. Running the exact prompt repeatedly first establishes how much variation is just noise. Only differences exceeding that floor count as real sensitivity, which keeps you from chasing problems that are not actually fragility.
How is this checklist different from the best-practices article?
The best-practices article argues the reasoning at length and explains when to bend each rule. This checklist compresses those conclusions into an actionable pass you can run before shipping. Use the best-practices piece to understand the why deeply, and this checklist as the working tool you actually walk through under time pressure.
What does it mean to schedule re-runs, and how often?
It means running the saved suite automatically on a recurring basis, not only when you remember. Frequency depends on stakes and how often your model provider updates, but monthly is a reasonable default for many teams, with event-triggered runs on every prompt edit or model change. Scheduling removes the discipline problem because the test happens without anyone initiating it.
Can I add my own items to this checklist?
You should. This list covers the general failure modes, but your domain may have specific risks β regulatory constraints, language coverage, latency bounds β worth adding as items. The structure of pairing each item with its reasoning is the part to preserve, because that is what keeps the checklist from becoming mechanical box-ticking.
What is the minimum viable version of this checklist?
If you can do only a few things: write a success criterion, build a benchmark with adversarial inputs, run at production temperature, score pass or fail, and save the suite for re-running. Those five items capture most of the protective value. The remaining items refine accuracy and attribution, which matter more as stakes rise.
Key Takeaways
- Each checklist item carries its reasoning, so you can adapt and skip honestly rather than ticking boxes by rote.
- Setup items β a written criterion and an adversarial benchmark including past failures β determine whether the rest of the test means anything.
- Variation items prevent the silent errors: preserve meaning, change one dimension at a time, and keep an unmodified baseline.
- Execution and scoring separate sensitivity from randomness, test at both temperatures, and convert outputs into categorized, pattern-level findings.
- Maintenance items turn the test into an instrument: re-run fully after fixes and schedule recurring runs to catch silent model drift.