Checklists exist because smart people forget steps under pressure. Sentiment and emotion detection is full of small decisions that feel optional until one of them quietly wrecks your accuracy — an undefined label, a missing escape hatch for ambiguity, a test set that does not match production. The cost of skipping a step rarely shows up at launch. It shows up three weeks later when a stakeholder stops trusting the output.
This is a working checklist, organized by the order you should actually do things: scope, define, prompt, test, ship, monitor. Each item includes a one-line justification so you can decide whether it applies to your situation rather than following it blindly. Copy it into your project doc and check items off as you go.
Treat the items as defaults, not laws. If you skip one, skip it on purpose.
Phase 1: Scope the Problem
Before writing a single prompt, decide what you are actually measuring and why.
Scoping items
- Name the decision the output feeds. If no decision changes based on the label, you are doing analysis theater.
- Choose sentiment, emotion, or both. They are different tasks; emotion is harder and needs richer labels.
- Pick your label set and freeze it. Shifting labels mid-project invalidates every test you have run.
- Define the unit of analysis. A whole review, a sentence, or a speaker turn produce very different results.
Phase 2: Define Every Label
This is the step teams skip and then regret. Definitions are where accuracy is won.
Definition items
- Define each label as observable behavior, not topic. "Negative" means an explicit complaint, not the presence of a problem word.
- Write at least one counter-example per label. The calm bug report that scores neutral prevents your most common error.
- Decide the target of sentiment. Sentiment toward the product, the company, or the writer's own situation are different things.
- Specify how to handle resolved past issues. Without this, glowing reviews mentioning old problems get mislabeled.
The reasoning behind these definitions is shown in action in Concrete Sentiment Prompts That Worked (and the Ones That Backfired).
Phase 3: Build the Prompt
Now translate definitions into instructions the model can follow.
Prompting items
- Allow multiple labels with intensity when text is mixed. Forcing a single label on mixed text manufactures errors.
- Add an explicit "uncertain" or "ambiguous" option. A flagged unknown is worth more than a confident guess.
- Require a supporting quote for each label. Grounding improves accuracy and enables auditing.
- Specify output format precisely (JSON or fixed schema). Downstream systems break on free-form responses.
A structured version of this lives in A Reusable Model for Reading Tone in Text at Scale.
Phase 4: Test Against Ground Truth
A prompt you have not tested against labeled data is a guess.
Testing items
- Hand-label 100-200 representative examples. Include hard and ambiguous cases, not just easy ones.
- Measure agreement, not just accuracy. For imbalanced label sets, raw accuracy hides systematic errors.
- Run error analysis and cluster failures. Patterns in the misses tell you what to fix next.
- Re-test after every prompt or model change. Improvements in one area often regress another.
The metrics to track are detailed in Reading the Signal: Scoring Sentiment Systems You Can Trust.
Phase 5: Ship and Monitor
Launch is the start of the work, not the end.
Launch items
- Route "uncertain" items to human review. This keeps automated accuracy high where it counts.
- Log inputs, outputs, and quotes. You cannot debug what you did not record.
- Set a drift alarm on label distribution. A sudden shift in negative rate usually means input or model drift, not customer mood.
- Schedule a quarterly re-validation against fresh labels. Language and products change; your test set should too.
Phase 6: Handle the Edge Cases on Purpose
The long tail is where untested systems quietly fail. Decide your policy for each edge case before it appears in production, not after.
Edge-case items
- Decide your sarcasm policy. You will not detect it perfectly; route conflicting literal-versus-intended meaning to "uncertain" rather than guessing.
- Specify handling for non-English or mixed-language text. A model may silently degrade; flag or segment by language so quality stays measurable.
- Set a minimum length threshold. Two-word reviews carry too little signal; label them low-confidence rather than forcing a confident call.
- Define behavior for empty or junk input. Bot spam and blank fields should return a "no signal" label, not a fabricated emotion.
These cases mirror the failures dissected in Concrete Sentiment Prompts That Worked (and the Ones That Backfired), where unhandled edge cases were the difference between a demo and a shippable system.
Phase 7: Govern and Document
A sentiment system that infers emotional states from people carries obligations beyond accuracy.
Governance items
- Record what you infer and why. If a stakeholder or regulator asks, you need a clear purpose for inferring emotion.
- Keep the supporting quotes auditable. Grounded labels let you defend any individual decision after the fact.
- Note consent and data-source constraints. Inferring emotion from customers raises questions you should answer before launch, not during an incident.
- Assign an owner. A system without a named owner drifts, decays, and eventually misleads. Make maintenance someone's job.
The reasoning behind these governance items, and where the field is heading on them, sits in Granular Emotion and Honest Uncertainty Are Reshaping Tone Detection. For the deeper structural logic behind the whole list, see A Reusable Model for Reading Tone in Text at Scale.
How to Use This Checklist
A checklist only works if it changes behavior, so treat it as a gate rather than a reference you skim once and forget.
Working it into your process
- Run it in order. The phases build on each other; you cannot test a prompt whose labels you never defined.
- Check items off in writing. A mental pass through the list is how steps get silently skipped under deadline pressure.
- Record deliberate skips. If an item does not apply, note why. An undocumented skip is indistinguishable from an oversight three weeks later.
- Re-run it on major changes. A new model, a new data source, or a new label set re-opens earlier phases, especially definition and testing.
The biggest mistakes this list prevents are the quiet ones — the undefined label, the missing uncertainty path, the test set that never matched production. None of them announce themselves at launch. They surface later as a stakeholder who stopped trusting the output and cannot quite say why. Working the list honestly is how you keep that conversation from happening. The fastest route to a first pass through these phases is in Your Fastest Credible Path to a First Working Tone Classifier.
Frequently Asked Questions
Which checklist item matters most if I only have time for one?
Defining each label as observable behavior with a counter-example. It prevents the single most common failure — confusing negative vocabulary with negative emotion — and costs almost nothing to do.
How many examples do I really need to hand-label?
A minimum of 100-200 that reflect your real distribution and deliberately include hard cases. Below that, your accuracy estimates are too noisy to trust, and you risk shipping a worse prompt that scored well by luck.
Do I need both sentiment and emotion labels?
Only if a downstream decision uses both. Sentiment (positive/negative/neutral) is simpler and more reliable. Emotion detection is harder and should be added only when the extra granularity changes what someone does.
Why log the supporting quotes in production?
Quotes let you audit any label after the fact, debug systematic errors, and prove to skeptical stakeholders that decisions are grounded. Without them, every dispute becomes an unwinnable argument about a black box.
What is a good signal that I skipped the definition phase?
Your negative rate is much higher than manual review suggests, or reviews mentioning resolved problems get tagged negative. Both point to a model matching vocabulary because no one told it what the labels actually mean.
How often should I re-validate after launch?
Quarterly at minimum, plus immediately after any model upgrade. Products, slang, and customer expectations drift, and a test set that reflected last year's reviews can quietly stop representing today's.
Key Takeaways
- Scope the decision the labels feed before writing any prompt
- Define every label as observable behavior with at least one counter-example
- Allow multiple labels, intensity, and an explicit "uncertain" option
- Test against 100-200 hand-labeled examples and cluster the failures
- Route uncertain items to humans and log every input, output, and quote
- Set drift alarms and re-validate quarterly to prevent silent decay