This is a working checklist, not a reading list. Run it before you ship any prompt that decides between zero-shot and few-shot, and again whenever you change models. Each item includes a short justification, because a checklist you do not understand is one you will skip under pressure.
The structure follows the order you should actually work in: baseline first, then decide whether examples are warranted, then validate cost and stability, then set up the maintenance loop. Skipping ahead is how teams end up paying for examples they never needed.
Phase 1: Establish the Baseline
Before you decide anything, you need a measured starting point.
- Write an explicit zero-shot instruction. Why: if the instruction can fully specify the task, you may not need examples at all, and a strong instruction transfers across models.
- Build a labeled eval set from real inputs. Why: without measured accuracy, "few-shot is better" is an opinion, not a fact. Include messy and ambiguous cases.
- Run the zero-shot baseline and record accuracy per category. Why: this is the number every later decision is measured against.
- Record prompt token count and latency. Why: these are the costs examples will add, and you need the before-figure to compare.
If the zero-shot baseline meets your accuracy bar, stop here β you are done, and you have the cheapest possible prompt. The discipline behind this phase is in our best practices guide.
Phase 2: Decide Whether Examples Are Warranted
Only proceed if zero-shot fell short on specific inputs.
- Identify exactly which inputs failed. Why: examples should target real failure modes, not be added blanket-style across the whole task.
- Ask whether the failure is a missing definition or a missing example. Why: a vague instruction is fixed by rewriting the instruction, not by adding examples β a distinction our common mistakes guide covers in depth.
- Confirm the task carries implicit rules. Why: examples earn their tokens when they encode schemas, brand voice, or code style that words struggle to specify. For well-described tasks, they waste tokens.
Phase 3: Build the Example Set Properly
If examples are warranted, build the set with discipline.
- Pull examples from real production data. Why: curated-clean examples teach the easy distribution and fail on messy inputs.
- Include at least one hard or ambiguous case. Why: two hard examples teach more than six pristine ones.
- Balance the labels. Why: an imbalanced set biases the model toward the majority class on ambiguous inputs.
- Start with two examples; add more only on measured gains. Why: most accuracy appears in the first two or three; beyond five you usually pay tokens for nothing.
See Real-World Examples and Use Cases for what good example sets look like per task type.
Phase 4: Validate Cost and Stability
A more accurate prompt that is unstable or expensive is not a win.
- Re-measure token count and latency with examples added. Why: confirm the accuracy gain justifies the cost, especially at high volume.
- Test the same input under different example orderings. Why: if the answer changes, you have order or recency bias to fix before shipping.
- Audit the output label distribution against a balanced set. Why: a skewed distribution reveals hidden bias the headline accuracy number can mask.
Phase 5: Set Up the Maintenance Loop
Prompts rot; the loop keeps them honest.
- Schedule a zero-shot re-test on every model upgrade. Why: newer models often solve zero-shot what older ones needed examples for β you can frequently delete half the prompt.
- Refresh the eval set when the input distribution shifts. Why: a stale eval set hides regressions on new segments or formats.
- Track monthly token spend on examples. Why: when a stable, high-volume task's example cost grows large, it is time to model a fine-tune. The trade-offs guide covers that threshold.
How to Run This Checklist Under Time Pressure
The full checklist is the ideal, but real teams ship under deadlines. Here is the compressed version when you have an hour, not a day.
Build a minimal eval set of 50 labeled real inputs β even a spreadsheet works. Run zero-shot with the best instruction you can write in fifteen minutes. If accuracy clears your bar, ship it; you are done and you have the cheapest prompt. If it falls short, look at the failing rows and ask the single most important question on the whole list: is the failure a vague instruction or a genuinely implicit rule? Patch the instruction first, re-run, and only then add one or two targeted examples.
Even this stripped-down pass catches the two most expensive mistakes β adding examples a task did not need, and papering over a fixable instruction. The full Phase 4 and 5 work can follow once the prompt is in production. What you must never skip, even under pressure, is the labeled eval set; without it every decision is a guess, and guesses are what put stale, over-engineered prompts into production in the first place.
A Note on Sequencing
The phases are ordered deliberately, and the order is load-bearing. Teams that jump to Phase 3 (building examples) before completing Phase 1 (baseline) almost always over-build, because they never learned what zero-shot already handled. Teams that skip Phase 4 (validation) ship prompts whose accuracy depends on an arbitrary example ordering. And teams that skip Phase 5 (maintenance) watch their prompts rot across model upgrades.
If you take one structural lesson from this checklist, take this: work the phases in order, and treat each phase's exit criterion as a gate. You do not earn the right to add examples until the baseline proves you need them, and you do not earn the right to ship until validation proves the result is stable. This gating discipline is what separates a prompt that demos well from one that survives production.
Red Flags That Mean You Skipped a Phase
Use these symptoms as a diagnostic. If any describe your current prompt, an earlier phase was skipped and is now costing you.
- Nobody can explain why a given example is in the prompt. You skipped documenting examples in Phase 3, and the set is now untouchable bloat.
- Accuracy was "checked" on a handful of inputs, not a labeled set. You skipped the Phase 1 eval set, and order bias or unrepresentative behavior is hiding from you.
- The prompt has not been re-tested since the last model upgrade. You skipped Phase 5, and you are likely paying for examples a newer model has made unnecessary.
- The prompt grew over time and never shrank. You have been adding examples without the Phase 4 discipline of measuring whether each earns its tokens.
- Outputs skew toward one label on ambiguous inputs. You skipped the Phase 4 order-bias audit, and recency or majority bias is shaping your results.
Each red flag maps to a specific phase. The remedy is not to patch the symptom but to run the skipped phase properly β usually starting by building the eval set you never made.
Turning the Checklist Into a Habit
A checklist used once is a formality; used every time, it is a quality system. The teams that get the most from this embed it into their workflow rather than treating it as a launch-day ritual. Concretely: make the Phase 1 baseline a required step in code review for any prompt change, and make the Phase 5 re-test a standing item in your model-upgrade runbook.
The goal is for "did you baseline zero-shot first?" and "have you re-tested since the upgrade?" to become questions your team asks automatically, the way they ask whether code has tests. Once the checklist is muscle memory, the expensive mistakes it guards against stop happening β not because anyone is being careful, but because the process makes carelessness hard.
Frequently Asked Questions
Can I skip the eval set if I'm short on time?
No β it is the one item that makes every other decision objective. Even a small set of 50 to 100 labeled real inputs turns guesses into measurements and pays for itself the first time it catches a regression.
What if zero-shot is close but not quite good enough?
Identify the specific failing inputs first. Often a sharper instruction closes the gap with no examples; if not, add one or two examples targeted at exactly those failures rather than across the board.
How do I test for order bias quickly?
Take a handful of ambiguous inputs and run each under two or three different example orderings. If the predicted label changes, your example set is biased and needs label balancing or shuffling.
When should the maintenance loop trigger a fine-tune?
When the task is narrow and stable, volume is high, and the monthly token cost of carrying examples grows large enough to exceed a fine-tune's amortized cost. At that point fine-tuning is cheaper and more consistent.
Does this checklist change for reasoning tasks?
The structure holds, but in Phase 3 your examples should demonstrate the reasoning process rather than just answers. Always re-test against a zero-shot "reason step by step" instruction, which now closes much of the gap on capable models.
Key Takeaways
- Baseline zero-shot with a labeled eval set before deciding anything.
- Add examples only to target measured failures, and only when the task carries implicit rules.
- Build example sets from real data, balanced, with hard cases, starting at two.
- Validate cost, latency, and order-bias before shipping.
- Re-test zero-shot on every model upgrade and track example token spend over time.