This case study follows one team through a single decision: whether their support ticket classifier should keep relying on a six-example few-shot prompt or move toward something leaner. The details are composited from real engagements, but the arc β inherited prompt, untested assumptions, measured rework, durable outcome β is exactly how these decisions play out. The lesson is not "zero-shot is better" or "few-shot is better." It is that the choice must be re-earned with evidence whenever the model or the data changes.
The Situation
A SaaS company routed incoming support tickets into eight categories β billing, bug report, feature request, account access, and so on β using an LLM classifier. The prompt had been written 14 months earlier and carried six hand-picked example tickets, each with its label. It worked well enough that nobody touched it.
Two pressures forced a review. First, ticket volume had grown to roughly 40,000 a month, and the six examples added about 1,400 tokens to every single call β a line item finance had started asking about. Second, accuracy had quietly slipped on a newer category, "integration issue," which had not existed when the examples were chosen.
The Decision
The team faced three options:
- Leave it alone. Cheapest in effort, but the cost and the accuracy gap would persist.
- Add more examples to cover the new category. The intuitive move, but it would push prompt length higher and deepen the dependency on a curated set nobody understood.
- Re-baseline from scratch. Strip the examples, test zero-shot on the current model, and add examples back only where evidence demanded.
They chose to re-baseline. The reasoning mirrored the discipline in our best practices guide: the model had been upgraded twice since the prompt was written, so the assumption that six examples were necessary was a year out of date and untested.
The Execution
The team built what they should have had from the start: a labeled eval set of 600 real tickets, sampled to include the messy and ambiguous ones, not just clean cases. They measured three things on every variant β accuracy per category, prompt token count, and time-to-first-token.
They tested four configurations:
- Zero-shot with a rewritten, explicit instruction naming all eight categories and their boundaries.
- Two-shot with one easy and one ambiguous example.
- The original six-shot prompt, as a control.
- Two-shot plus one example specifically for the new "integration issue" category.
The key move was rewriting the instruction first. The original instruction was vague β it leaned on the examples to define the categories. The new one defined each category explicitly, including how to distinguish "bug report" from "integration issue," exactly the failure mode described in our common mistakes guide.
The Outcome
The results were decisive and a little humbling:
- Zero-shot with the rewritten instruction matched the original six-shot accuracy on the seven established categories and beat it on the new category, because the instruction now defined that category clearly while the old examples did not cover it.
- The two-shot-plus variant edged out zero-shot by a small margin specifically on "integration issue," where one well-chosen example resolved the remaining ambiguity.
- Prompt tokens dropped by roughly two-thirds moving from six examples to two, with a corresponding cut in per-call cost across 40,000 monthly tickets.
The team shipped the two-shot-plus variant: near-best accuracy at a fraction of the original token cost. The single most valuable artifact was not the prompt β it was the eval set, which they now re-run on every model upgrade.
What Almost Went Wrong
It is worth dwelling on the option the team nearly chose: adding more examples. It was the intuitive move and it would have worked in the narrow sense β a seventh and eighth example covering the new category would have lifted accuracy. But it would have pushed the prompt past 1,800 tokens, deepened the dependency on a curated set nobody understood, and left the vague instruction in place to cause the next gap.
The trap is that adding examples almost always produces a local improvement, which makes it feel correct. The team only avoided it because they had committed to re-baselining before measuring. Had they started by patching the existing prompt, they would have shipped a more expensive, more fragile classifier and called it a win. This is the quiet danger of few-shot: it is so reliably helpful in the small that it discourages you from asking whether you needed it at all.
The Maintenance Change That Stuck
The durable outcome was not the prompt β it was a process change. The team added a single recurring item to their model-upgrade runbook: re-run the zero-shot baseline against the eval set and report whether examples can be dropped. The first time they applied it, on the next model upgrade three months later, they were able to remove the remaining two examples entirely on five of the eight categories, because the newer model had absorbed those distinctions zero-shot.
That runbook item is the real deliverable. The prompt will keep changing; the habit of re-earning every example with evidence is what compounds. For the structured version of this decision process, see A Framework for Zero Shot vs Few Shot Learning and the metrics guide.
The Lessons
The honest takeaway is that the original prompt was not bad when it was written; it was just never revisited. The examples had silently become the specification, and the specification had gone stale. Re-baselining surfaced two facts at once: most of the examples were no longer needed, and the real accuracy gap was a missing category definition, not a missing example.
The second lesson is about process over artifacts. The team's most valuable outputs were the eval set and the runbook item, not the winning prompt. Prompts are disposable; the capacity to measure and re-decide is the asset. Any team running LLM classifiers at volume should expect their prompts to be over-engineered within a year and build the re-baseline habit accordingly.
The Numbers Behind the Decision
It helps to see how the eval made the choice obvious rather than a matter of taste. Across the 600-ticket evaluation set, the team tracked accuracy by category alongside prompt length for each of the four configurations.
The original six-shot prompt and the rewritten zero-shot prompt landed within a point of each other on the seven established categories β close enough that token cost became the tiebreaker, and zero-shot won that decisively at roughly a third of the prompt length. On the new "integration issue" category, zero-shot actually pulled ahead of the old six-shot prompt, because the rewritten instruction defined that category while the year-old examples predated it entirely. The two-shot-plus variant then recovered the last sliver of accuracy on that one category with a single targeted example.
Without the eval set, none of this would have been visible. The team's prior belief β "six examples are doing important work" β was simply false, and only measurement exposed it. The decision stopped being a debate the moment the per-category numbers were on the table.
Why the Team Almost Didn't Run the Eval
The honest part of this story is that building the 600-ticket eval set took real effort β sampling, labeling, validating β and there was pressure to skip it and just "try adding a couple examples." The argument that won was simple: at 40,000 tickets a month, even a small per-call savings compounds, and you cannot justify removing examples to stakeholders without numbers. The eval set was not overhead; it was the only way to make a defensible decision.
That framing β eval set as decision-justification, not as academic rigor β is what gets these projects funded in practice. The team did not build it to be thorough; they built it because they could not otherwise prove the change was safe.
Frequently Asked Questions
Why did zero-shot match a six-shot prompt here?
Because the model had been upgraded twice since the examples were chosen, and a rewritten, explicit instruction now carried the specification the examples used to supply. The examples had become redundant without anyone noticing.
What was the single biggest driver of the improvement?
Rewriting the instruction to define each category explicitly β especially the new one. The accuracy gap was a missing definition, not a missing example, which no amount of added examples would have fixed cleanly.
Was building the eval set worth the effort?
Entirely. It turned every "I think this is better" into a measured number and became a reusable asset for future model upgrades. Without it, the team could not have justified removing examples.
Should every team strip their prompts back to zero-shot?
Not blindly β but every team should re-baseline after a model change. You may keep some examples, as this team kept two, but you should prove each one earns its tokens rather than inheriting them.
How did they decide two examples were enough?
The eval showed two examples captured the accuracy gain, with the third-through-sixth adding nothing measurable. They kept the minimum set that held accuracy, a practice detailed in the trade-offs guide.
Key Takeaways
- Inherited prompts go stale; examples silently become an untested specification.
- Re-baselining after model upgrades often reveals you can delete most examples.
- A vague instruction propped up by examples hides accuracy gaps in new edge cases.
- The eval set is the durable asset β the prompt is disposable.
- The winning config here cut token cost by two-thirds while improving accuracy on the hardest category.