A playbook is different from a tutorial. A tutorial teaches you how a thing works; a playbook tells you what to do, when to do it, and who owns it. This is the operating manual for building a sentiment and emotion detection capability — a sequence of named plays you can run, each with a clear trigger and an owner, so the work moves from idea to reliable production system without stalling in the middle.
The plays are sequenced deliberately. Skipping ahead — building a fancy multi-label classifier before you have agreed on what the labels mean — is the most common way these projects collapse. Run them in order the first time, then return to specific plays as triggers fire.
Each play below names what sets it off, the steps, and who should hold it.
Play 1: Scope the Decision
Trigger: someone proposes using emotion detection for something.
Steps
Before any prompting, define what decision the output will drive and how wrong the model is allowed to be. An aggregate trend dashboard tolerates far more error than a system that escalates distressed customers. The stakes determine every later choice.
Owner
The person accountable for the decision the output feeds — usually a product or CX lead, not the prompt author. This prevents building a precise classifier for a problem that did not need one, or a sloppy one for a problem that did.
Play 2: Define the Taxonomy
Trigger: scope is agreed and the project is greenlit.
Steps
Write the exact set of labels with one or two example messages per label that define the boundary. Keep it small. Decide whether you need polarity, discrete emotions, dimensional scores, or aspect-level output based on the decision from Play 1.
Owner
A single taxonomy owner who will maintain it as edge cases surface. This artifact becomes the contract for everything downstream, exactly as described in Rolling Out Prompting for Sentiment and Emotion Detection Across a Team.
Play 3: Build the Gold Set
Trigger: taxonomy is defined.
Steps
Hand-label a few hundred representative examples against the taxonomy, deliberately including hard cases — sarcasm, mixed sentiment, domain idioms. Record where annotators disagree. This set is how you will measure everything; build it before the prompt, not after.
Owner
Whoever owns quality. The gold set is the project's source of truth and should not be owned by the same person racing to ship the prompt.
Play 4: Draft and Iterate the Prompt
Trigger: gold set exists.
Steps
Write a constrained prompt with a fixed output format, two or three domain few-shot examples, and an explicit uncertain path. Measure against the gold set, read the errors, and iterate. Use a reasoning step for hard cases. The advanced techniques you reach for here are in When Sarcasm Breaks Your Emotion Classifier, Try This.
Owner
The prompt author, working against the gold set rather than vibes.
Play 5: Validate Per Class and Per Group
Trigger: the prompt produces stable output.
Steps
Compute precision and recall per emotion class, not just overall accuracy, and where possible disaggregate by the populations your text represents. Fix systematic weaknesses before shipping. This guards against the fairness failures detailed in The Hidden Risks of Prompting for Sentiment and Emotion Detection (and How to Manage Them).
Owner
The quality owner, who signs off that the classifier meets the bar set in Play 1.
Play 6: Ship With the Right Automation Level
Trigger: validation passes.
Steps
Match automation to stakes. Automate aggregate analytics and the clearest individual cases; route uncertain and high-stakes calls to humans. Wrap the classifier in the team's existing workflow rather than a new ceremony. The repeatable process scaffolding is in Building a Repeatable Workflow for Prompting for Sentiment and Emotion Detection.
Owner
The product or operations owner who controls the workflow it plugs into.
Play 7: Monitor and Re-Run
Trigger: the system is live and on a recurring schedule.
Steps
Re-run against a fresh labeled sample on a cadence, watch label distributions for sudden shifts, and run a calibration session if the team grows or definitions drift. Feed resolved edge cases back into the taxonomy and gold set.
Owner
The taxonomy and quality owners jointly. Without a scheduled trigger, monitoring quietly stops happening.
Play 8: Handle the Failure Drills
Trigger: the quality gate fails, accuracy decays, or a stakeholder disputes a result.
Steps
Have a predefined response rather than improvising under pressure. When the gate fails, pause the batch, pull the misclassified examples, and determine whether the cause is input drift, a model change, or a prompt regression. Roll back to the last known-good prompt version while you diagnose. When a stakeholder disputes a label, trace it through the structured output and evidence span rather than relitigating from memory.
Owner
The quality owner runs the drill; the decision owner is informed if the failure affects live decisions. A rehearsed failure path is what separates a mature capability from one that panics when something breaks.
Sequencing the Plays Together
The plays are not independent — they form a chain where each one's output is the next one's input.
The first-build path
For an initial build, run plays one through six in order, then stand up seven and eight as standing capabilities. Skipping ahead — most commonly jumping to play four before plays two and three exist — is the single most reliable way to end up with a classifier nobody can measure or trust.
Returning to plays as triggers fire
Once live, you re-enter specific plays when their triggers fire: a new use case sends you back to play one, a model upgrade sends you to play five, and a disputed result sends you to play eight. Treating the playbook as a set of triggered routines rather than a one-time checklist is what keeps the capability healthy as it ages.
A Worked Example of the Sequence
To make the sequence concrete, consider a support team that wants to flag angry tickets for faster handling.
How the plays unfold
Play one scopes the decision: angry tickets jump the queue, so a false negative — missing real anger — is worse than a false positive. That stakes assessment says favor recall on the anger class. Play two defines a small taxonomy with anger sharply distinguished from mere frustration. Play three builds a gold set heavy on the boundary cases between the two. Play four writes a prompt that reasons about tone before labeling, and play five validates recall on anger specifically rather than overall accuracy.
Where it pays off
By play six, the team automates escalation only for high-confidence anger and routes uncertain cases to a human, matching automation to the stakes set in play one. Play seven catches the day a product launch floods the queue with a new vocabulary the prompt has not seen, and play eight rolls back cleanly when the gate flags the drop. The sequence is what made each of those steps a deliberate choice rather than a scramble.
Frequently Asked Questions
What is the most common play teams skip?
Defining the taxonomy and building the gold set before writing the prompt. Teams rush to a clever prompt and then have no way to measure whether it works, so they ship on intuition and discover the problems in production.
Who should own the overall capability?
Ownership is split deliberately: the decision owner sets stakes, the taxonomy owner maintains definitions, and the quality owner guards the gold set and validation. One person wearing all three hats tends to cut corners on whichever conflicts with shipping.
How small should the taxonomy be?
Small enough that the team and the model apply it consistently — usually a handful of well-defined categories. Granularity beyond that erodes agreement between annotators and the model, which defeats the purpose.
When do I re-run the validation plays?
On a fixed schedule, and whenever you change the prompt, swap the model, or notice a shift in label distributions. Treat any of those as a trigger to re-run Play 5 against a fresh sample.
Can I run these plays out of order?
For a first build, no — each play depends on the previous one's output. Once the capability is live, you revisit individual plays as their triggers fire, but the initial sequence should run in order.
Key Takeaways
- A playbook assigns triggers and owners to each step so the build does not stall midway.
- Scope the decision and its stakes first; they determine every later choice.
- Define the taxonomy and build the gold set before writing the prompt, not after.
- Validate per class and per group, then match automation level to the stakes.
- Monitoring and re-running need a scheduled trigger and a named owner or they silently lapse.