Zero-shot classification looks forgiving. You describe some categories, the model sorts text, and the first few outputs look right. That early success hides the problem: most zero-shot classifiers fail quietly. They do not crash or throw errors. They just misfile a steady fraction of inputs in ways nobody notices until a downstream report is wrong or a customer gets routed to the wrong team.
This article names eight failure modes that come up again and again. For each, it explains why the failure happens, what it actually costs, and the specific corrective practice that removes it. These are not exotic edge cases; they are the ordinary mistakes that separate a classifier you can trust from one that silently degrades your data.
Read these as a checklist. Most struggling classifiers are failing in two or three of these ways at once, and fixing them is usually a matter of tightening definitions and constraints rather than anything elaborate.
Mistake 1: Overlapping Category Definitions
The most common failure. Two categories mean nearly the same thing, so the model has to guess between them.
Why It Happens
People list category names without defining boundaries. "Complaint" and "negative feedback," or "question" and "request," blur together. The model splits inconsistently across them, and the same input might land in either depending on phrasing.
The Cost and the Fix
You get unstable, irreproducible classifications that look fine on any single example but scatter in aggregate. Fix it by defining each category to explicitly exclude the others, merging categories that genuinely overlap. Distinct boundaries are the foundation laid out in the step-by-step procedure for sorting text by description.
Mistake 2: No "Other" Category
Without an escape hatch, the model forces every input into some label, even when none fits.
Why It Happens
People list only the categories they expect and forget that real inputs include junk, off-topic text, and edge cases. The model, told to pick a category, picks the least-bad one rather than admitting nothing fits.
The Cost and the Fix
Misfiled outliers inflate the error rate invisibly, because the wrong answers look like normal answers. Add an explicit "other" or "none" label and watch its size — a large "other" bucket tells you your categories are incomplete, which is useful information rather than a failure.
Mistake 3: Unconstrained Output
Letting the model respond freely produces text you cannot reliably parse.
Why It Happens
The prompt asks "which category does this belong to?" without specifying the answer format. The model replies with explanations, hedges, or rephrasings — "This appears to be a billing question, though it could also be technical."
The Cost and the Fix
Parsing breaks, automation fails, and the hedging hides real uncertainty as if it were a clean answer. Instruct the model to respond with only the exact label, or use structured output like JSON. Constrained output is the difference between a usable classifier and a text blob, as stressed in the end-to-end walkthrough of classifying with no labeled data.
Mistake 4: Too Many Categories at Once
Long label lists degrade accuracy because the model cannot hold all the boundaries distinct.
Why It Happens
Teams try to capture every nuance with a flat list of fifteen or twenty categories. The boundaries blur under their own weight, and definitions start to overlap simply because there are so many.
The Cost and the Fix
Accuracy drops across the board, especially for adjacent categories. Reduce the list, group categories hierarchically, or classify in stages — first into broad buckets, then into sub-categories within each. A small, sharp set beats a sprawling one.
Mistake 5: Never Measuring Accuracy
Trusting the classifier because the first outputs looked right.
Why It Happens
The early examples seem correct, so people assume the whole thing works and skip building a validation set. There is no feedback telling them about the quiet misfiles happening at scale.
The Cost and the Fix
You ship a classifier with an unknown error rate and find out about problems from downstream damage. Build a hand-labeled validation set of a few hundred inputs, measure per-category accuracy, and inspect the confusions. Measuring before trusting is the spine of What Reliable Zero-Shot Classifiers Have in Common.
Mistake 6: Ignoring Output Variability
Assuming the same input always produces the same label.
Why It Happens
People run the classifier with default randomness settings, which introduce variation. The same input can get different labels on different runs, especially on borderline cases.
The Cost and the Fix
Reports become irreproducible and the same item gets sorted differently over time. Use low-randomness settings for classification tasks and pin the model and prompt versions so results stay stable. Determinism is something you configure, not something you assume.
Mistake 7: Forcing Single Labels on Multi-Category Text
Demanding exactly one label for text that genuinely belongs to several.
Why It Happens
The prompt insists on one category, but a real message might be both a billing question and a complaint. The model picks one and silently drops the other.
The Cost and the Fix
You lose real information and route or count the item incorrectly. Decide up front whether your task is single-label or multi-label. If inputs can belong to several categories, design and instruct for multiple labels rather than forcing a false choice. This decision point is flagged early in the from-scratch introduction to zero-shot classification.
Mistake 8: Treating the Classifier as Set-and-Forget
Assuming that a classifier accurate at launch stays accurate forever.
Why It Happens
People test a classifier, see good numbers, deploy it, and move on. There is no scheduled re-check, so nobody notices when the inputs gradually change and accuracy erodes underneath. The classifier keeps producing labels with the same confidence even as more of them become wrong.
The Cost and the Fix
You discover the degradation only when downstream damage forces an investigation, by which point months of data may be misfiled. Schedule periodic re-measurement against a fresh hand-labeled sample, and watch the "other" bucket for signs that new kinds of input are arriving. A classifier is a living system, not a one-time build — the same maintenance mindset stressed in the step-by-step procedure for sorting text by description.
How These Mistakes Compound
The failure modes above rarely appear alone. Overlapping categories and a missing "other" bucket together guarantee a stream of confident misfiles. Add unconstrained output and no measurement, and you have a classifier that looks like it works, cannot be parsed reliably, and is never checked against truth. The damage multiplies because each mistake hides the others: without measurement you never see the misfiles, and without an "other" bucket the misfiles look like ordinary answers. Fixing them as a set — sharp definitions, an escape hatch, constrained output, and regular measurement — is what turns a fragile demo into something you can trust on real volume.
Frequently Asked Questions
Which of these mistakes is the most damaging?
Skipping measurement, because it hides all the others. A classifier with overlapping categories and no "other" bucket can run for months looking fine if nobody ever checks its accuracy against ground truth. Measurement is what surfaces every other problem on this list.
How do I know if my categories overlap?
Take a handful of real inputs and try to classify them yourself using only your written definitions. If you hesitate between two categories, or could justify either, those categories overlap. If a human author of the definitions is unsure, the model has no chance of being consistent.
Is a large "other" bucket always a problem?
Not necessarily — it can mean your categories simply do not cover a chunk of real input, which is information. If "other" is large and the items in it share a theme, that theme is probably a missing category. Inspect the bucket rather than ignoring it.
Can I fix unstable output without changing the prompt?
Partly, by lowering randomness settings, which makes classification more deterministic. But instability often also comes from genuinely ambiguous inputs hitting blurry category boundaries, which only sharper definitions fix. Address both the settings and the definitions.
Key Takeaways
- Most zero-shot classifiers fail quietly by misfiling a steady fraction of inputs, not by crashing
- Overlapping definitions and a missing "other" category are the two most common and damaging mistakes
- Unconstrained output and high randomness make results unparseable and irreproducible; fix both deliberately
- Too many flat categories degrade accuracy; group hierarchically or classify in stages instead
- Never trust a classifier you have not measured against a hand-labeled validation set with per-category accuracy