Repeatable Plays for Classifiers Without Labeled Data

A playbook is not a tutorial. A tutorial teaches you how something works once. A playbook tells you which move to make in a given situation, who makes it, and what has to happen before and after. This is the operating playbook for zero-shot classification prompting: a set of named plays you can call by situation rather than reasoning from first principles every time.

Each play below has a trigger (the situation that calls for it), an owner (who runs it), and a sequence (what comes before and after). Treat them as plays you select, not steps you march through in order. The point of naming them is that a team can say "run the disambiguation play on the refund category" and everyone knows what that means.

For the linear version aimed at a single builder, Building a Repeatable Workflow for Zero-shot Classification Prompting walks the same ground as a process.

Play 1: The Cold Start

Trigger: a new classification need with no existing classifier and no labeled data.

How it runs

Owner: the requester, with a prompt-fluent partner if they are not one.
Write the label set as one-sentence definitions, not just names.
Add an "ambiguous" class up front.
Build the smallest possible prompt with a strict enumerated output.

What comes next

Hand directly to the Evaluation play. Never ship a cold-start classifier on vibes; the first version always looks better than it is.

Play 2: Evaluation

Trigger: any classifier, new or changed, before it touches production.

How it runs

Owner: the classifier owner.
Pull a sample of real inputs, not curated examples.
Label them by hand once, then score the classifier per category.
Report accuracy per label, especially the rare-but-important ones.

This play is the backbone of the whole program. A classifier without it is unmeasured guesswork, a point hammered in Five Beliefs About Zero-shot Classifiers That Cost Teams Accuracy.

Play 3: Disambiguation

Trigger: the Evaluation play shows two categories getting confused.

How it runs

Owner: the classifier owner.
Identify the specific pair of labels that blur.
Write an explicit rule in the prompt distinguishing them.
Re-run Evaluation on just those categories to confirm the fix.

Resist the urge to fix confusion by adding more model power; fix it by sharpening the boundary, the central lesson of Where Zero-shot Classifiers Quietly Break at Scale.

Play 4: The Split

Trigger: more than roughly eight to ten categories with persistent confusion.

How it runs

Owner: the classifier owner.
Group categories under coarse parents.
Build a coarse classifier first, then a second prompt per contested parent.
Evaluate each stage independently.

The Split usually recovers accuracy that a single flat prompt cannot reach. The trade is added latency and cost from running two passes, so reserve it for taxonomies where a flat prompt has genuinely plateaued rather than reaching for it by default.

A note on sequencing the Split

Always run the Split after Disambiguation, not before. If a two-category confusion can be fixed with a sharper boundary, fixing it is cheaper than restructuring the whole taxonomy. The Split is the move you make when sharpening boundaries one pair at a time stops being enough.

Play 5: The Drift Watch

Trigger: any classifier running in production.

How it runs

Owner: the classifier owner, on a fixed cadence.
Sample production classifications for human review weekly.
Track per-label volumes and the size of the ambiguous bucket.
Re-run Evaluation on fresh data periodically.

Because drift produces no error, this play is the only thing standing between a healthy classifier and silent decay, as detailed in What Confidently Wrong Classifiers Cost You.

Play 6: The Handoff

Trigger: a classifier needs a new owner, or you are scaling across a team.

How it runs

Outgoing owner: package the label definitions, the evaluation set, the latest per-label accuracy, and one documented failure that was fixed.
Register the classifier in the central list with owner, purpose, and last-evaluated date.
Incoming owner: re-run Evaluation before accepting.

The Handoff is what keeps classifiers from becoming unowned and unaccountable, and it underpins Getting an Entire Team to Classify the Same Way Without Training Data.

Play 7: The Regression Gate

Trigger: any proposed change to a live classifier's prompt or labels.

How it runs

Owner: whoever is making the change.
Re-run the existing evaluation set before merging the change.
Compare per-label accuracy against the prior version, not just the overall number.
Refuse the change if it drops accuracy on any category that matters, even if the overall figure improves.

This play exists because prompt edits are silent regressions: a tweak that fixes one category routinely degrades another with no visible signal. Treating evaluation as a gate, the way software treats tests, is the only reliable defense, and it complements the risk framing in What Confidently Wrong Classifiers Cost You.

Sequencing the Plays

A typical lifecycle runs Cold Start, Evaluation, then Disambiguation or Split as the evaluation reveals problems, then Drift Watch continuously once live, with Handoff invoked whenever ownership moves and the Regression Gate firing on every change. The plays are not strictly linear; Evaluation and Disambiguation cycle until the numbers hold, and Drift Watch never ends.

Calling the Right Play Under Pressure

The value of named plays shows most when something is going wrong and people are tempted to flail. A quick decision guide keeps the response disciplined.

Symptom to play

A brand-new need with no data: run Cold Start, then Evaluation. Do not skip to production.
Two categories getting confused: run Disambiguation before anything heavier.
Many categories still confused after disambiguation: run the Split.
Accuracy was fine and is now slipping: this is the Drift Watch play surfacing a problem; pull the recent errors and diagnose before editing.
About to change a live prompt: run the Regression Gate first, no exceptions.
Ownership is moving or unclear: run the Handoff and update the registry.

The discipline the plays enforce

The plays exist precisely to stop the two most common panic moves: rewriting the prompt blindly when accuracy drops, and reaching for a bigger model to paper over a taxonomy problem. By naming the correct response to each situation, the playbook replaces improvisation with a reflex, which is what you want when something is on fire. The deeper reasoning behind each reflex lives in Where Zero-shot Classifiers Quietly Break at Scale.

Keep a one-page version visible

A playbook only works if people can recall it when they need it, which is rarely when they have time to read a long document. Distill the symptom-to-play mapping onto a single page and keep it where the team works. The point of naming plays is fast recall under pressure, and that benefit evaporates if invoking a play requires hunting through documentation. The lightest possible reference, six symptoms mapped to six plays, is enough to change behavior.

Frequently Asked Questions

Do I have to run every play for every classifier?

No. Cold Start, Evaluation, and Drift Watch apply to essentially all classifiers. Disambiguation and the Split are conditional, run them only when evaluation reveals the triggering problem.

Who should own a classifier in production?

A single named person, recorded in the registry. Shared or absent ownership is how classifiers drift unaccountably. The Handoff play exists specifically to keep ownership explicit.

How often should the Drift Watch play run?

Weekly sampling for review is a reasonable default, with a fuller re-evaluation on a longer cadence. Higher-stakes classifiers warrant more frequent attention.

What is the most commonly skipped play?

Evaluation, because the first version always looks good enough. Skipping it is the single most reliable way to ship a classifier that fails quietly in production.

Key Takeaways

Treat zero-shot classification as a set of named plays selected by situation, not a fixed sequence.
Cold Start, Evaluation, and Drift Watch apply to nearly every classifier.
Fix category confusion with the Disambiguation and Split plays, not by reaching for a bigger model.
The Drift Watch play is the only defense against silent production decay.
The Handoff play and a central registry keep classifiers owned and accountable.
Evaluation is the most-skipped and most-important play; never ship without it.

For the linear version aimed at a single builder, Building a Repeatable Workflow for Zero-shot Classification Prompting walks the same ground as a process.

Play 1: The Cold Start

Trigger: a new classification need with no existing classifier and no labeled data.

How it runs

Owner: the requester, with a prompt-fluent partner if they are not one.
Write the label set as one-sentence definitions, not just names.
Add an "ambiguous" class up front.
Build the smallest possible prompt with a strict enumerated output.

What comes next

Hand directly to the Evaluation play. Never ship a cold-start classifier on vibes; the first version always looks better than it is.

Play 2: Evaluation

Trigger: any classifier, new or changed, before it touches production.

How it runs

Owner: the classifier owner.
Pull a sample of real inputs, not curated examples.
Label them by hand once, then score the classifier per category.
Report accuracy per label, especially the rare-but-important ones.

This play is the backbone of the whole program. A classifier without it is unmeasured guesswork, a point hammered in Five Beliefs About Zero-shot Classifiers That Cost Teams Accuracy.

Play 3: Disambiguation

Trigger: the Evaluation play shows two categories getting confused.

How it runs

Owner: the classifier owner.
Identify the specific pair of labels that blur.
Write an explicit rule in the prompt distinguishing them.
Re-run Evaluation on just those categories to confirm the fix.

Resist the urge to fix confusion by adding more model power; fix it by sharpening the boundary, the central lesson of Where Zero-shot Classifiers Quietly Break at Scale.

Play 4: The Split

Trigger: more than roughly eight to ten categories with persistent confusion.

How it runs

Owner: the classifier owner.
Group categories under coarse parents.
Build a coarse classifier first, then a second prompt per contested parent.
Evaluate each stage independently.

A note on sequencing the Split

Play 5: The Drift Watch

Trigger: any classifier running in production.

How it runs

Owner: the classifier owner, on a fixed cadence.
Sample production classifications for human review weekly.
Track per-label volumes and the size of the ambiguous bucket.
Re-run Evaluation on fresh data periodically.

Because drift produces no error, this play is the only thing standing between a healthy classifier and silent decay, as detailed in What Confidently Wrong Classifiers Cost You.

Play 6: The Handoff

Trigger: a classifier needs a new owner, or you are scaling across a team.

How it runs

Outgoing owner: package the label definitions, the evaluation set, the latest per-label accuracy, and one documented failure that was fixed.
Register the classifier in the central list with owner, purpose, and last-evaluated date.
Incoming owner: re-run Evaluation before accepting.

The Handoff is what keeps classifiers from becoming unowned and unaccountable, and it underpins Getting an Entire Team to Classify the Same Way Without Training Data.

Play 7: The Regression Gate

Trigger: any proposed change to a live classifier's prompt or labels.

How it runs

Owner: whoever is making the change.
Re-run the existing evaluation set before merging the change.
Compare per-label accuracy against the prior version, not just the overall number.
Refuse the change if it drops accuracy on any category that matters, even if the overall figure improves.

Sequencing the Plays

Calling the Right Play Under Pressure

The value of named plays shows most when something is going wrong and people are tempted to flail. A quick decision guide keeps the response disciplined.

Symptom to play

A brand-new need with no data: run Cold Start, then Evaluation. Do not skip to production.
Two categories getting confused: run Disambiguation before anything heavier.
Many categories still confused after disambiguation: run the Split.
Accuracy was fine and is now slipping: this is the Drift Watch play surfacing a problem; pull the recent errors and diagnose before editing.
About to change a live prompt: run the Regression Gate first, no exceptions.
Ownership is moving or unclear: run the Handoff and update the registry.

The discipline the plays enforce

Keep a one-page version visible

Frequently Asked Questions

Do I have to run every play for every classifier?

No. Cold Start, Evaluation, and Drift Watch apply to essentially all classifiers. Disambiguation and the Split are conditional, run them only when evaluation reveals the triggering problem.

Who should own a classifier in production?

A single named person, recorded in the registry. Shared or absent ownership is how classifiers drift unaccountably. The Handoff play exists specifically to keep ownership explicit.

How often should the Drift Watch play run?

Weekly sampling for review is a reasonable default, with a fuller re-evaluation on a longer cadence. Higher-stakes classifiers warrant more frequent attention.

What is the most commonly skipped play?

Evaluation, because the first version always looks good enough. Skipping it is the single most reliable way to ship a classifier that fails quietly in production.

Key Takeaways

Treat zero-shot classification as a set of named plays selected by situation, not a fixed sequence.
Cold Start, Evaluation, and Drift Watch apply to nearly every classifier.
Fix category confusion with the Disambiguation and Split plays, not by reaching for a bigger model.
The Drift Watch play is the only defense against silent production decay.
The Handoff play and a central registry keep classifiers owned and accountable.
Evaluation is the most-skipped and most-important play; never ship without it.

Repeatable Plays for Classifiers Without Labeled Data

Play 1: The Cold Start

How it runs

What comes next

Play 2: Evaluation

How it runs

Play 3: Disambiguation

How it runs

Play 4: The Split

How it runs

A note on sequencing the Split

Play 5: The Drift Watch

How it runs

Play 6: The Handoff

How it runs

Play 7: The Regression Gate

How it runs

Sequencing the Plays

Calling the Right Play Under Pressure

Symptom to play

The discipline the plays enforce

Keep a one-page version visible

Frequently Asked Questions

Do I have to run every play for every classifier?

Who should own a classifier in production?

How often should the Drift Watch play run?

What is the most commonly skipped play?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Repeatable Plays for Classifiers Without Labeled Data

Play 1: The Cold Start

How it runs

What comes next

Play 2: Evaluation

How it runs

Play 3: Disambiguation

How it runs

Play 4: The Split

How it runs

A note on sequencing the Split

Play 5: The Drift Watch

How it runs

Play 6: The Handoff

How it runs

Play 7: The Regression Gate

How it runs

Sequencing the Plays

Calling the Right Play Under Pressure

Symptom to play

The discipline the plays enforce

Keep a one-page version visible

Frequently Asked Questions

Do I have to run every play for every classifier?

Who should own a classifier in production?

How often should the Drift Watch play run?

What is the most commonly skipped play?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?