The clearest way to understand when to use zero-shot versus few-shot prompting is to look at concrete tasks and ask what actually made each approach win or fail. The principle behind every case is the same: examples help when the task carries implicit rules that are hard to state in words, and they waste tokens when a sharp instruction already covers the job.
Below are six real scenarios across common workloads. For each, we describe the task, which approach won, and the specific reason. The goal is pattern recognition you can transfer to your own work.
Use Case 1: Sentiment Classification — Zero-Shot Wins
Classifying customer reviews as positive, negative, or neutral is the textbook few-shot task, and it is usually the wrong place to spend examples. Modern instruction-tuned models handle three-way sentiment zero-shot with a single clear instruction.
Why zero-shot won
Sentiment is a task the model has seen exhaustively in pretraining. The labels are intuitive and the instruction is easy to write precisely. Adding examples here typically lifts accuracy by a point or two at most, while doubling or tripling prompt tokens.
The exception: if "neutral" means something domain-specific (say, a feature request that is neither praise nor complaint), two examples that pin down your definition of neutral are worth it.
Use Case 2: Structured Extraction with a Custom Schema — Few-Shot Wins
Pulling fields out of unstructured text into a fixed JSON schema — invoice line items, contract clauses, resume fields — is where few-shot earns its keep. The challenge is not understanding the text; it is matching your exact output structure and edge-case conventions.
Why few-shot won
You can describe a schema in words, but the model fills ambiguous fields inconsistently. Two or three examples showing how to handle missing values, multiple matches, and formatting (dates, currency) lock the behavior down. We have seen extraction consistency jump dramatically going from zero-shot to two well-chosen examples.
The trap, covered in common mistakes, is choosing only clean documents as examples. Include a messy one.
Use Case 3: Tone-Matched Copywriting — Few-Shot Wins
Asking a model to write in a specific brand voice is hard to specify in words. "Friendly but professional" means different things to different brands. Examples of real on-brand copy transfer the voice far better than adjectives.
Three short samples of approved copy give the model a target to imitate. This is one of the few cases where the examples genuinely encode information the instruction cannot — voice is shown, not told.
Use Case 4: Math and Multi-Step Reasoning — It Depends
For arithmetic word problems and logic, the relevant comparison is not just zero-shot versus few-shot but whether the examples demonstrate reasoning steps. A few-shot prompt with worked-out chains of reasoning often beats both bare zero-shot and few-shot with answers-only examples.
The nuance
The examples are not teaching the answer; they are teaching the process of showing work. On capable reasoning models, a simple zero-shot instruction to reason step by step now closes much of this gap, so re-test before committing to long worked examples.
Use Case 5: Code Generation in a House Style — Few-Shot Wins
Generating code that matches your team's conventions — naming, error handling, library choices — is a strong few-shot case. The model knows how to write code zero-shot; it does not know your conventions.
Two examples of functions written your way teach the style efficiently. This mirrors the extraction case: the model has the capability, and examples supply the local convention it cannot infer.
Use Case 6: Open-Ended Summarization — Zero-Shot Wins
Summarizing an article into three bullets is something models do well zero-shot. Examples here mostly anchor length and format, which you can specify directly in the instruction instead ("exactly three bullets, under 15 words each").
Reserve examples for when your summary needs an unusual structure or a domain-specific emphasis that is genuinely hard to describe.
Use Case 7: Multilingual Classification — It Depends on the Language
Classifying text in widely-spoken languages usually works zero-shot, because the model has seen abundant pretraining data. The picture changes for low-resource languages, where the model's grasp is thinner and a few examples in the target language can meaningfully anchor behavior.
Why the language matters
For a high-resource language, the model already understands sentiment, intent, and category boundaries — examples add little. For a low-resource language, examples do double duty: they demonstrate the task and prime the model into the right linguistic register, reducing the chance it drifts into the wrong language or misreads idiom. The practical move is to benchmark per language rather than assuming one global answer. A prompt that is comfortably zero-shot in one language may need two examples in another.
Use Case 8: Edge-Case Routing in a Support System — Few-Shot Wins Narrowly
A support classifier that is excellent on common categories often stumbles on a rare one — say, a security disclosure that must route to a special queue. Zero-shot handles the eight obvious categories; the ninth, rare, high-stakes category is where it gets lost, because the model has little signal that the category even exists.
This is the textbook case for targeted few-shot: one or two examples aimed at exactly the failing category, not a blanket example set across all categories. You spend tokens precisely where zero-shot fails and nowhere else. This surgical approach, rather than the reflex of adding examples everywhere, is the difference between an efficient prompt and a bloated one, as our common mistakes guide describes.
Reading the Pattern
Across all six cases, one rule predicts the outcome. Examples help when they encode something the instruction cannot easily state — a schema convention, a brand voice, a code style, a reasoning process. They waste tokens when the instruction can fully describe the task, as in sentiment and summarization. For the structured decision behind this, see A Framework for Zero Shot vs Few Shot Learning and the trade-offs guide.
A Failure Worth Studying: The Over-Curated Example Set
One scenario is instructive precisely because it failed. A team building a document classifier hand-picked six pristine, unambiguous examples for their few-shot prompt — the cleanest documents they could find. In testing on similar clean documents, accuracy looked excellent. In production, accuracy cratered on the messy real inputs: scanned PDFs with OCR errors, documents in mixed formats, half-complete submissions.
The examples had taught the model the easy distribution and nothing about the hard one. The fix was counterintuitive to the team: replace three clean examples with three deliberately messy ones — a document with OCR noise, a partial submission, an oddly formatted edge case. Accuracy on real inputs recovered immediately, with no change to the example count. The lesson generalizes: an example set is a sample of your input distribution, and a sample drawn only from the easy cases will fail on the hard ones. This is the single most common reason a few-shot prompt that demos well collapses in production.
When Two Approaches Are Better Than One
The cases above frame zero-shot and few-shot as alternatives, but the strongest real systems often blend them. A common pattern: zero-shot for the common path, with a targeted few-shot fallback for the inputs a confidence check flags as uncertain. The classifier runs cheaply zero-shot on the 90% of inputs it handles confidently, and only the ambiguous remainder pays the token cost of examples.
This hybrid captures most of the accuracy of few-shot at most of the cost savings of zero-shot. It works because the failure cases are usually a small, identifiable slice — the rare category, the ambiguous phrasing — rather than spread evenly across all inputs. Routing only those to a few-shot path spends tokens exactly where they buy accuracy and nowhere else, the same surgical principle that makes targeted examples beat blanket ones.
Frequently Asked Questions
Why does few-shot win for extraction but not sentiment?
Extraction depends on matching your exact schema and edge-case conventions, which are hard to state fully in words — examples encode them efficiently. Sentiment is a well-understood task the model handles from a clear instruction alone.
Can I use examples to teach reasoning steps?
Yes, and it often helps on multi-step problems — but show the process, not just the answer. On strong reasoning models, a zero-shot "reason step by step" instruction now closes much of the gap, so benchmark both.
How many examples do brand-voice tasks need?
Usually two or three short samples of approved copy. Voice is shown rather than described, so a few representative examples transfer it far better than adjectives in the instruction.
Do these patterns change with newer models?
Yes. Each model generation solves more zero-shot, so tasks that needed few-shot last year may not now. Re-test your baseline on every upgrade rather than assuming examples are still required.
What is the fastest way to decide for my own task?
Write a sharp zero-shot instruction, run it on a labeled set, and only add examples on the inputs it fails. The how-to guide lays out this loop step by step.
Key Takeaways
- Zero-shot wins on well-understood tasks like sentiment and summarization.
- Few-shot wins when examples encode something instructions cannot — schemas, brand voice, code style.
- For reasoning, demonstrate the process, not the answer, and re-test against zero-shot.
- Always include at least one messy example in extraction sets.
- The deciding question is whether the task carries implicit rules a clear instruction can capture.