A framework tells you what good looks like. A playbook tells you what to do, who does it, and when. The gap between knowing how to evaluate a prompt and actually evaluating every prompt that matters is an operational gap, and it is where most teams quietly fall down. They have a standard nobody triggers and a method nobody owns.
This playbook closes that gap. It lays out the discrete plays that make up prompt evaluation, the triggers that should fire each one, the owner accountable for each, and the sequence that ties them together. Treat it as an operating model you can adapt rather than a script to follow line by line. The aim is that evaluation happens reliably without depending on any one person remembering to do it.
The Plays
Evaluation is not a single action. It is a small set of distinct plays, each with its own purpose, that you run at different moments.
Play One: Define the Bar
Before any output exists, define what good means for the task and the failure rate you can tolerate. This play runs once per new prompt and is revisited when requirements change. Skipping it guarantees you will grade on vibes later. The structure for this play comes from A Framework for Evaluating Prompt Quality.
Play Two: Build the Test Set
Assemble inputs that represent real traffic, including messy and adversarial cases. This play produces the asset every later play depends on, so it deserves real effort. A starting checklist of what to include is The Evaluating Prompt Quality Checklist for 2026.
Play Three: Run and Score
Execute the prompt across the test set, sampling each input multiple times, and score the results against the rubric. This is the core measurement play and the one you repeat most often.
Play Four: Triage Failures
When the run surfaces failures, decide which are blocking, which are acceptable, and which need a prompt revision. Triage is a judgment play, and it is where domain knowledge earns its keep.
Play Five: Decide and Record
Make the ship-or-revise call and log it with its evidence. This play creates the audit trail that lets you learn from outcomes and defend decisions later.
The Triggers
Plays that run only when someone remembers them do not run. Each play needs a trigger that fires it automatically or reliably.
Event Triggers
- A new prompt enters development, firing Play One and Play Two
- A prompt is revised, firing Play Three against the existing test set
- The underlying model is updated, firing a full rerun to catch regressions
- A user reports a bad output, firing triage and a new test case
Scheduled Triggers
Some plays run on a clock rather than an event. Schedule periodic full reruns of production prompts, because prompts decay even when nothing visibly changes. The reasoning behind recurring evaluation is detailed in The Hidden Risks of Evaluating Prompt Quality.
The Owners
Every play needs a name attached, or it becomes everyone's job and therefore no one's. Ownership does not mean doing all the work; it means being accountable for the play happening.
Assign Accountable Owners
The prompt author typically owns defining the bar and building the initial test set. A designated quality owner owns the standard, the calibration of reviewers, and the integrity of the test set over time. The person shipping the work owns the final decide-and-record play. Spreading these roles prevents the single-bottleneck failure described in Rolling Out Evaluating Prompt Quality Across a Team.
Separate Authorship From Approval
Where stakes are high, the person who wrote the prompt should not be the only one who approves it. A second reviewer counters the natural bias to pass your own work. This separation is cheap insurance against rubber-stamp evaluation.
The Sequence
The plays only work in order. Running them out of sequence produces the illusion of evaluation without its substance.
The Standard Flow
Define the bar, build the test set, run and score, triage failures, then decide and record. Each play feeds the next: you cannot score without a test set, cannot triage without scores, and cannot decide responsibly without triage. When a trigger fires mid-flow, such as a user report, you re-enter at the relevant play rather than starting over. For turning this sequence into a documented, hand-off-able process, see Building a Repeatable Workflow for Evaluating Prompt Quality.
Tier the Plays by Stakes
Not every prompt deserves the full sequence. A low-stakes prompt that drafts internal notes might run only define-the-bar and a quick scored check, while a client-facing or automated prompt runs every play with a second reviewer. Decide the tier when the prompt is created so the team applies effort in proportion to risk rather than treating every prompt the same.
Wiring the Playbook Into Tools
A playbook that lives only in a document depends on memory, and memory fails under deadline. The plays stick when the triggers are wired into the tools the team already uses.
Attach Triggers to Existing Events
Hook the run-and-score play to the moment a prompt changes in version control, and the full-rerun play to a model-update notification. When the trigger fires automatically, the play runs whether or not anyone remembers it, which is the entire point of an operating model. Manual triggers should be the exception reserved for genuinely irregular events.
Make the Record a Byproduct
The decide-and-record play should produce its audit trail as a natural byproduct of the decision, not a separate chore. When logging the verdict and its evidence is built into the same step that makes the call, the audit trail stays complete instead of decaying the moment people are busy.
Frequently Asked Questions
How is a playbook different from a framework or a workflow?
A framework defines what quality means and how to score it. A workflow describes the repeatable steps to produce a judgment. A playbook sits above both: it specifies which plays to run, the triggers that fire them, the owners accountable for each, and the sequence that connects them. The framework is the standard, the workflow is the procedure, and the playbook is the operating model that ensures the procedure actually runs.
Do small teams need this much structure?
Small teams need the structure, not the bureaucracy. You can run every play in this playbook informally, with one person wearing several owner hats, as long as the plays still happen and someone is accountable for each. The risk for small teams is not over-process; it is that evaluation depends entirely on one person's memory and disappears the moment they are busy. Lightweight triggers solve that.
What triggers are most often forgotten?
The model-update trigger and the scheduled rerun. Teams reliably re-evaluate when they change a prompt but forget that an update to the underlying model can change behavior without any prompt change at all. Scheduled reruns are skipped because nothing appears to have changed. Both are exactly when silent regressions slip in, which is why they belong on automatic triggers rather than human memory.
Who should make the final ship decision?
The person accountable for the deliverable, informed by the evaluation evidence, should make the call, ideally with a second reviewer for high-stakes prompts. The key is that the decision is explicit and recorded, not implied by the absence of objections. An evaluation that produces evidence but no documented decision leaves no one accountable when something later goes wrong.
Key Takeaways
- Evaluation is a set of distinct plays: define the bar, build the test set, run and score, triage, and decide.
- Each play needs a trigger, including event triggers and scheduled reruns, so it does not depend on memory.
- Assign an accountable owner to every play and separate authorship from approval on high-stakes prompts.
- Run the plays in sequence, re-entering at the right point when a trigger fires mid-flow.
- The playbook is the operating model that makes a framework and workflow actually run reliably.