A mid-sized direct-to-consumer brand came to us with a familiar complaint: their automated review-tagging system was technically running, technically producing labels, and quietly useless. The marketing team had stopped trusting it within two weeks of launch and reverted to spot-checking reviews by hand. They wanted to know whether the idea was flawed or just the execution.
This is the story of that engagement — what the system did wrong, how we diagnosed it, the single conceptual change that fixed most of the problem, and the numbers that finally earned the team's trust. Names and exact figures are generalized, but the arc is real and repeats across most sentiment projects we see.
The lesson, if you want it up front: their model was never reading emotion. It was reading vocabulary. Once we forced it to define the construct before classifying, everything downstream improved.
The Situation: A System Nobody Trusted
The brand received roughly 4,000 reviews a month across their store and three marketplaces. They wanted each review tagged with sentiment (positive, neutral, negative) and a primary emotion (delight, frustration, regret, indifference) to feed a weekly product-quality report.
Symptoms of the broken system
- Negative tags on glowing reviews that happened to mention a past problem
- Positive tags on lukewarm reviews because they contained the word "love" in passing
- A negative rate nearly double what manual review showed
Why it mattered
The weekly report drove decisions about which products to pull and which to promote. Bad labels meant bad decisions, so the team rationally abandoned the tool. A system that is wrong and confident is worse than no system at all.
The Decision: Diagnose Before Rebuilding
We resisted the urge to immediately rewrite the prompt. Instead we hand-labeled 250 reviews to create ground truth, then ran the existing prompt against them to see exactly where it failed.
What the error analysis revealed
The errors clustered. The model was reliably correct on unambiguous reviews and reliably wrong on three categories: reviews mentioning a resolved problem, reviews with mixed emotion, and short reviews with strong individual words. Every failure traced back to the same root: the prompt asked "what is the sentiment?" without ever defining what sentiment meant for this brand's context.
The Execution: Naming the Construct
The redesign centered on one move — defining each label as observable behavior before asking for classification. The lessons here closely mirror Concrete Sentiment Prompts That Worked (and the Ones That Backfired).
The core changes
- Defined sentiment as the customer's overall stance toward the product now, explicitly ignoring resolved past issues
- Allowed up to two emotions with intensity, ending the forced single-label errors
- Added an "uncertain" route for genuinely mixed reviews
- Required a supporting quote for every emotion tag
We also introduced a brief reasoning step before the final label, which we kept hidden from the report but used during validation. The structure drew directly on A Reusable Model for Reading Tone in Text at Scale.
The Outcome: Numbers That Rebuilt Trust
Against the 250-review ground-truth set, agreement with human labels rose substantially, and the inflated negative rate fell back in line with manual review. More importantly, the marketing team ran their own blind spot-check and could not distinguish machine labels from human ones on the clear cases.
What changed operationally
- The weekly report became trusted enough to drive a product pull decision again
- Human review shrank to only the "uncertain" queue, cutting manual hours sharply
- The team adopted the labeled set as a permanent regression test for future prompt changes
To put a number on benefits like these, the approach in Quantifying the Payoff of Automated Tone Tagging maps directly to this engagement.
The Rollout: Easing the Team Back In
Rebuilding the prompt was only half the work. The harder half was earning back a team that had been burned once and had every reason to stay skeptical.
Running the system in shadow mode first
Rather than flip the new prompt straight into the live report, we ran it silently alongside the team's manual tagging for two weeks. Every day they tagged reviews by hand as usual, and the system tagged the same reviews in the background. Nobody acted on the machine output yet.
What shadow mode revealed
The parallel run did two things. It built a fresh, real-world accuracy record the team could see accumulating in their own data, and it surfaced a handful of edge cases the initial 250-review set had missed — a new product line with its own vocabulary. We folded those into the evaluation set before going live, so the launch reflected current reality rather than a snapshot from week one.
- Run the new system in parallel before anyone depends on it
- Compare machine and human labels daily, not just at the end
- Use the parallel period to catch distribution gaps in your test set
What Almost Went Wrong
The engagement was not frictionless, and the near-misses are as instructive as the wins.
The temptation to chase every edge case
After the redesign worked, the team wanted to keep adding rules for every odd review they spotted. Each rule helped one case and risked confusing five others. We held the line: anything genuinely ambiguous went to the "uncertain" queue rather than spawning a new rule. Restraint kept the prompt legible and stable.
The metric that could have misled
Early on, overall accuracy looked great because most reviews are positive and the model nailed those. Per-class recall on negatives — the reviews that actually drive product decisions — told a more sobering story until we tightened the negative definition. Watching the right metric, not the flattering one, kept the project honest, a discipline detailed in Reading the Signal: Scoring Sentiment Systems You Can Trust.
Lessons Worth Stealing
The most expensive mistake was launching without a labeled evaluation set, which meant nobody could see why the system was failing — only that it was. The cheapest, highest-leverage fix was definitional, not technical. We did not change models, add training data, or build a pipeline. We told the model what the words meant.
Portable takeaways
- Build ground truth before you build the prompt
- Error analysis tells you what to fix; vibes do not
- Most "model" problems are actually definition problems
- Trust is rebuilt with a blind test, not a demo
Frequently Asked Questions
How many reviews did you need to label to find the problem?
About 250, sampled to include the hard cases — resolved complaints, mixed emotion, and short strong-word reviews. That was enough to make the error clusters obvious. You do not need thousands; you need a sample that reflects where the system actually struggles.
Was the original system using a bad model?
No. The model was fine. The prompt never defined what sentiment meant for this brand, so the model defaulted to surface vocabulary matching. Swapping models would not have fixed a definitional gap.
Why allow an "uncertain" label instead of forcing a decision?
Because forcing a label on genuinely ambiguous reviews creates confident errors that poison downstream reports. Routing uncertain items to a small human queue preserved accuracy on everything else and kept the manual workload tiny.
How did you convince the team to trust the rebuilt system?
A blind test. We mixed machine and human labels on clear-case reviews and asked the team to tell them apart. They could not. That demonstration did more than any accuracy chart, because it spoke in their terms.
What stopped the same problem from coming back?
The labeled evaluation set became a permanent regression test. Any future prompt change had to maintain or beat the established agreement rate before shipping, which prevented silent quality drift.
Could this have been solved with fine-tuning instead?
Possibly, but it would have been slower, costlier, and harder to adjust. The definitional fix took days, cost almost nothing, and remained editable in plain language. Reach for fine-tuning only after careful prompting plateaus.
Key Takeaways
- The system was reading vocabulary, not emotion, until the construct was defined
- A hand-labeled evaluation set was the tool that revealed the real failure
- The highest-leverage fix was definitional, not technical or model-based
- An "uncertain" route preserved accuracy and shrank human review
- A blind test, not a demo, rebuilt the team's trust
- The labeled set became a permanent regression guard against drift