Choosing between zero-shot and few-shot prompting looks like a small decision. It is not. The choice silently sets your token bill, your latency floor, your bias exposure, and how reliably the model handles inputs you never anticipated. Most teams get it wrong in the same predictable ways, and the failures rarely announce themselves — accuracy looks fine on the five examples someone tested by hand, then drifts in production.
This article names seven concrete failure modes we see repeatedly, why each happens, what it costs, and the corrective practice. None of these are exotic. They are the everyday mistakes that separate a prompt that demos well from one that holds up under load.
Mistake 1: Adding Few-Shot Examples When the Task Was Already Zero-Shot Solvable
The reflex to paste in three examples is strong, and it is often wasted effort. Modern instruction-tuned models handle classification, extraction, and reformatting tasks zero-shot when the instruction is specific. People add examples anyway, out of habit.
The cost is real: every example you prepend is tokens you pay for on every single call, plus added latency. On a high-volume endpoint, three 200-token examples can double your prompt cost for zero accuracy gain.
The fix: Always benchmark zero-shot first with a sharp instruction. Only add examples if a measured eval shows zero-shot failing. If you want the structured discipline behind this, see A Step-by-Step Approach to Zero Shot vs Few Shot Learning.
Mistake 2: Letting Example Order Bias the Output
Few-shot models are sensitive to the order and recency of examples. If your last two examples are both labeled "positive," the model leans positive on ambiguous inputs. This is recency bias, and it is invisible until you audit the distribution of outputs.
The cost shows up as skewed predictions — a sentiment classifier that over-reports the majority label, or an extractor that copies the format of whichever example came last.
The fix: Shuffle example order across calls when feasible, balance labels within the example set, and never sort examples by class. Test with the same input under different example orderings; if the answer changes, your prompt is unstable.
Mistake 3: Choosing Unrepresentative Examples
Teams grab the cleanest, easiest examples for their few-shot set because those are the ones they understand. The model then learns the easy distribution and fails on the messy real inputs — typos, mixed languages, edge formats.
Why it happens
Curating examples is manual, and humans gravitate to tidy cases. The few-shot set ends up being a flattering mirror, not a representative sample.
The fix: Pull examples from real production data, deliberately including hard and ambiguous cases. Two well-chosen hard examples beat six pristine ones.
Mistake 4: Using Few-Shot to Paper Over a Bad Instruction
Few-shot examples can rescue a vague instruction, which is exactly why they are dangerous. The examples carry the actual specification, so the written instruction stays sloppy. When you later change models or trim examples to save tokens, the implicit rules vanish and behavior collapses.
The fix: Write the instruction as if you had zero examples. The examples should reinforce a clear spec, not substitute for one. Our best practices guide covers how to make instructions carry their own weight.
Mistake 5: Ignoring the Token and Latency Budget
A five-shot prompt with long examples can run 2,000+ tokens before the user's input arrives. At scale that is a meaningful cost line and a latency penalty on every request. Teams optimize the model choice and never look at prompt length.
The cost is twofold: dollars per million tokens, and a slower time-to-first-token that users feel.
The fix: Measure prompt token count explicitly. If few-shot adds accuracy, find the minimum number of examples that captures the gain — often two, rarely more than five. Treat each added example as a cost that must justify itself.
Mistake 6: Confusing Few-Shot Prompting with Fine-Tuning
Few-shot prompting teaches nothing permanent. Every call re-sends the examples; the model does not retain them. Teams sometimes treat a few-shot prompt as if it has "learned" the task, then are surprised when behavior is inconsistent or when a new edge case ignores the pattern.
The fix: Understand the boundary. If you need durable, high-volume, consistent behavior on a narrow task, fine-tuning amortizes better than ever-growing prompts. If you need flexibility and low setup cost, prompting wins. The trade-offs guide maps this decision cleanly.
Mistake 7: Never Re-Testing After a Model Upgrade
A prompt tuned with five examples on last year's model may be over-engineered for this year's. Newer models often solve zero-shot what older ones needed few-shot for. Teams carry forward stale prompts and keep paying the example tax indefinitely.
The fix: Re-run your zero-shot baseline every time you change models. Strip examples and see if accuracy holds. You will frequently find you can delete half your prompt.
The Meta-Mistake: Testing on Five Inputs by Hand
Underneath all seven mistakes sits a single root cause: deciding whether a prompt works by eyeballing a handful of inputs instead of measuring against a labeled set. Hand-testing five inputs feels like diligence, but it systematically misses the failure modes above. Order bias does not show up on five clean examples. Unrepresentative example sets look perfect when you test on the same kind of clean inputs you curated from. Token bloat is invisible if you never count tokens.
The corrective practice is non-negotiable: build a labeled eval set of real inputs — including the messy and ambiguous ones — and score every prompt change against it. Even fifty inputs turn "this looks good" into a number you can defend. Every mistake in this article is cheap to avoid once you measure and nearly impossible to catch when you do not. If you fix only one thing, fix this, because it is the mistake that hides all the others.
How These Mistakes Compound
These failures rarely appear alone; they reinforce each other. A team adds examples reflexively (Mistake 1), curates only clean ones (Mistake 3), and leans on them instead of writing a clear instruction (Mistake 4). The result is a long, expensive prompt (Mistake 5) with order bias (Mistake 2) that nobody re-tests after upgrades (Mistake 7) because they have confused it with permanent learning (Mistake 6). Each mistake makes the next more likely and the whole prompt more fragile.
Breaking the chain at the top — measuring before adding examples — prevents most of the downstream failures automatically. That is why the corrective practices all trace back to the same discipline: treat the zero-shot-versus-few-shot choice as a measured engineering decision, not a reflex.
Frequently Asked Questions
Is few-shot always more accurate than zero-shot?
No. On tasks that modern instruction-tuned models handle natively, examples add cost without accuracy. Few-shot helps most on tasks with implicit formatting rules, niche domains, or unusual output structures that are hard to describe in words.
How many examples should a few-shot prompt have?
Start with two and add only if an eval shows improvement. Most of the gain appears within the first two or three examples; beyond five you usually pay tokens for diminishing returns.
Can example order really change the answer?
Yes, measurably. Models exhibit recency and majority-label bias from the example set. If reordering your examples changes outputs on the same input, your prompt is unstable and needs balancing.
Should I use few-shot or fine-tuning for a high-volume task?
For narrow, stable, high-volume tasks, fine-tuning often costs less per call than re-sending long example prompts and gives more consistent behavior. For evolving or low-volume tasks, prompting is faster to ship and cheaper to maintain.
How do I know if my zero-shot prompt is good enough?
Run it against a labeled eval set of real inputs, including hard cases. If accuracy meets your bar, do not add examples. See Real-World Examples and Use Cases for benchmarks worth modeling.
Key Takeaways
- Benchmark zero-shot first; only add examples when a measured eval proves they help.
- Example order and label balance bias outputs — shuffle and balance deliberately.
- Pull few-shot examples from real, messy production data, not pristine cases.
- Keep instructions strong enough to stand without examples.
- Count your prompt tokens; every example must earn its cost in accuracy.
- Re-test the zero-shot baseline after every model upgrade — you can often delete examples.