A Working Checklist for Vetting Prompts Before They Ship

A checklist is only useful if every item earns its place. A list of vague reminders gets ignored; a list of concrete checks with a reason behind each becomes a tool you actually reach for before shipping. This is the second kind. Each item below is something you can verify with a yes or no, and each comes with a short justification so you understand why it matters rather than following it on faith.

Use this as a working document. Copy the items into your own notes, run a prompt against them before it goes live, and treat any unchecked box as a reason to pause. The checklist is organized by phase — preparation, scoring, robustness, and decision — because the order in which you verify things affects how much rework you avoid.

The goal is not bureaucracy. It is to make sure the prompt you are about to trust has been tested against the failure modes that actually sink prompts in production.

Before You Evaluate

These items set up an evaluation that can produce trustworthy results. Skip them and everything downstream is built on sand.

Success criteria are written and testable. A stranger should be able to score an output without asking you what good means. Vague criteria make every later score meaningless.
A test set of at least 15 to 30 inputs exists. One example measures luck. A set measures behavior across the range you will actually face.
The test set covers common, edge, and adversarial inputs. Real traffic includes empty fields, long inputs, and attempts to break the prompt. Testing only the happy path certifies a fantasy.
A held-out set is reserved for final measurement. Tuning and measuring on the same examples inflates your score and hides poor generalization.

During Scoring

These items keep the scoring itself honest and efficient.

Each criterion is matched to a scoring method. Structured requirements get programmatic checks; subjective ones get a rubric. Using human judgment for things code could check wastes your scarcest resource.
Programmatic checks validate structure where applicable. Confirm JSON parses, required fields exist, and values fall in range. These failures silently break downstream systems if unchecked.
Any model grader has been validated against humans. An unvalidated judge can certify a bad prompt with full confidence. Check agreement on a sample first.
Scores follow the rubric, not the output's charm. A fluent answer talks you into accepting violations. Committing to criteria in advance is your defense.

For the reasoning behind these scoring choices, see What Separates a Reliable Prompt From a Lucky One.

Robustness and Consistency

These items catch the failures that only appear under repetition and stress.

Important inputs were run multiple times. Models are probabilistic. A single pass can hide a prompt that fails intermittently in production.
Variance, not just the average, was recorded. A high mean with wide spread still produces painful, hard-to-trace failures. The worst case is what users hit.
Adversarial and injection inputs were tested. Prompts in live systems face hostile inputs. Confirming graceful degradation prevents security and safety surprises.
The most dangerous failure mode is named and tracked. Whether it is fabrication or an unsafe promise, naming it ensures every evaluation watches for it directly.

For more on these failure modes, see 7 Common Mistakes with Evaluating Prompt Quality.

Before You Ship

These items turn a pile of scores into a defensible decision.

Quality is weighed against cost and latency. A higher score that triples cost may be the wrong call for a high-volume feature. Optimize against the binding constraint.
A quality floor was set before evaluating. Deciding the bar in advance keeps the launch decision grounded in the cost of mistakes rather than enthusiasm.
The test set, pass rate, and decision are documented. Reproducibility lets the next person trust and rebuild your reasoning.
A plan exists to refresh the test set with production failures. Traffic drifts, and a frozen test set slowly stops describing reality.

To build these checks into a full routine, see A Step-by-Step Approach to Evaluating Prompt Quality.

How to Use the Checklist Without It Becoming Theater

A checklist fails when people check boxes mechanically to satisfy a process rather than to catch problems. To keep it honest, pair each check with the artifact that proves it. The criteria box is only checked when the written criteria actually exist in a document. The held-out set box is only checked when that set is named and stored separately. The variance box is only checked when the multiple-run results are recorded somewhere you could show a skeptic.

Tying each item to evidence turns the checklist from a ritual into an audit. Anyone can review the artifacts and confirm the work happened. This matters most under deadline pressure, exactly when the temptation to wave items through is strongest. If a box cannot be backed by an artifact, treat it as unchecked, because an unverified claim of rigor is worse than an honest gap — it gives false confidence that no one will revisit.

It also helps to have a second person run the checklist rather than the author of the prompt. The person who wrote a prompt is the worst-placed to judge it, because they know what it was supposed to do and unconsciously fill gaps the prompt itself leaves open. A reviewer with fresh eyes reads outputs the way a stranger would, which is closer to how production inputs will be handled. Even a lightweight peer sign-off catches assumptions the author cannot see, and it distributes knowledge of the evaluation across more than one person, which matters when the original author moves on.

Adapting the Checklist to Your Stakes

Not every prompt warrants every item, and pretending otherwise breeds cynicism. Build two versions: a lightweight pass for low-stakes internal prompts covering only preparation and basic scoring, and a full pass for anything customer-facing or high-consequence covering robustness and the ship-decision items as well. Deciding which version applies before you start keeps the process proportionate, so the checklist earns its keep instead of becoming busywork people resent.

Frequently Asked Questions

How is a checklist different from just knowing best practices?

Knowing best practices is necessary but easy to skip under deadline pressure. A checklist forces explicit verification of each one before shipping, turning intentions into actions. Each unchecked box becomes a concrete reason to pause, which is far more reliable than trusting yourself to remember everything in the moment.

Do I really need every item for a low-stakes prompt?

Scale it to the stakes. A throwaway internal prompt can skip adversarial testing and variance measurement. A customer-facing or high-consequence prompt should clear every item. The preparation and scoring sections, though, are worth doing almost always, because they are cheap and prevent the most common self-deceptions.

What does it mean to set a quality floor before evaluating?

It means deciding the minimum pass rate the task requires before you see any scores, based on the cost of a mistake. A high-stakes task might demand 98 percent plus human review; a low-risk one might accept 90 percent. Setting it in advance prevents you from rationalizing a weak result after the fact.

How often should I rerun this checklist on an existing prompt?

Whenever the prompt changes, the underlying model version changes, or production traffic shifts meaningfully. At minimum, revisit active prompts on a regular cadence and after any incident. Each of those events can invalidate a previously passing evaluation, so a fresh pass through the checklist keeps your confidence accurate.

Key Takeaways

Treat the checklist as a working tool: every unchecked item is a reason to pause before shipping.
Preparation items, written criteria and a representative held-out test set, make everything downstream trustworthy.
Match scoring methods to criteria and validate any judge, human or model, before relying on it.
Measure variance and test adversarial inputs to catch failures that single runs hide.
Weigh quality against cost and latency, set a quality floor in advance, and document the decision.

The goal is not bureaucracy. It is to make sure the prompt you are about to trust has been tested against the failure modes that actually sink prompts in production.

Before You Evaluate

These items set up an evaluation that can produce trustworthy results. Skip them and everything downstream is built on sand.

Success criteria are written and testable. A stranger should be able to score an output without asking you what good means. Vague criteria make every later score meaningless.
A test set of at least 15 to 30 inputs exists. One example measures luck. A set measures behavior across the range you will actually face.
The test set covers common, edge, and adversarial inputs. Real traffic includes empty fields, long inputs, and attempts to break the prompt. Testing only the happy path certifies a fantasy.
A held-out set is reserved for final measurement. Tuning and measuring on the same examples inflates your score and hides poor generalization.

During Scoring

These items keep the scoring itself honest and efficient.

Each criterion is matched to a scoring method. Structured requirements get programmatic checks; subjective ones get a rubric. Using human judgment for things code could check wastes your scarcest resource.
Programmatic checks validate structure where applicable. Confirm JSON parses, required fields exist, and values fall in range. These failures silently break downstream systems if unchecked.
Any model grader has been validated against humans. An unvalidated judge can certify a bad prompt with full confidence. Check agreement on a sample first.
Scores follow the rubric, not the output's charm. A fluent answer talks you into accepting violations. Committing to criteria in advance is your defense.

For the reasoning behind these scoring choices, see What Separates a Reliable Prompt From a Lucky One.

Robustness and Consistency

These items catch the failures that only appear under repetition and stress.

Important inputs were run multiple times. Models are probabilistic. A single pass can hide a prompt that fails intermittently in production.
Variance, not just the average, was recorded. A high mean with wide spread still produces painful, hard-to-trace failures. The worst case is what users hit.
Adversarial and injection inputs were tested. Prompts in live systems face hostile inputs. Confirming graceful degradation prevents security and safety surprises.
The most dangerous failure mode is named and tracked. Whether it is fabrication or an unsafe promise, naming it ensures every evaluation watches for it directly.

For more on these failure modes, see 7 Common Mistakes with Evaluating Prompt Quality.

Before You Ship

These items turn a pile of scores into a defensible decision.

Quality is weighed against cost and latency. A higher score that triples cost may be the wrong call for a high-volume feature. Optimize against the binding constraint.
A quality floor was set before evaluating. Deciding the bar in advance keeps the launch decision grounded in the cost of mistakes rather than enthusiasm.
The test set, pass rate, and decision are documented. Reproducibility lets the next person trust and rebuild your reasoning.
A plan exists to refresh the test set with production failures. Traffic drifts, and a frozen test set slowly stops describing reality.

To build these checks into a full routine, see A Step-by-Step Approach to Evaluating Prompt Quality.

How to Use the Checklist Without It Becoming Theater

Adapting the Checklist to Your Stakes

Frequently Asked Questions

How is a checklist different from just knowing best practices?

Do I really need every item for a low-stakes prompt?

What does it mean to set a quality floor before evaluating?

How often should I rerun this checklist on an existing prompt?

Key Takeaways

Treat the checklist as a working tool: every unchecked item is a reason to pause before shipping.
Preparation items, written criteria and a representative held-out test set, make everything downstream trustworthy.
Match scoring methods to criteria and validate any judge, human or model, before relying on it.
Measure variance and test adversarial inputs to catch failures that single runs hide.
Weigh quality against cost and latency, set a quality floor in advance, and document the decision.

A Working Checklist for Vetting Prompts Before They Ship

Before You Evaluate

During Scoring

Robustness and Consistency

Before You Ship

How to Use the Checklist Without It Becoming Theater

Adapting the Checklist to Your Stakes

Frequently Asked Questions

How is a checklist different from just knowing best practices?

Do I really need every item for a low-stakes prompt?

What does it mean to set a quality floor before evaluating?

How often should I rerun this checklist on an existing prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

A Working Checklist for Vetting Prompts Before They Ship

Before You Evaluate

During Scoring

Robustness and Consistency

Before You Ship

How to Use the Checklist Without It Becoming Theater

Adapting the Checklist to Your Stakes

Frequently Asked Questions

How is a checklist different from just knowing best practices?

Do I really need every item for a low-stakes prompt?

What does it mean to set a quality floor before evaluating?

How often should I rerun this checklist on an existing prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?