Abstract advice about prompt evaluation only goes so far. The clearest way to learn the craft is to watch specific prompts get evaluated and see what the process actually reveals. This guide walks through four scenarios across common task types, each showing the inputs, the failure that surfaced, and the change that fixed it.
The scenarios are deliberately varied — a classifier, a data extractor, an open-ended summarizer, and a customer-facing email writer — because the right evaluation method changes with the task. A technique that works perfectly for classification is useless for tone, and vice versa. Seeing them side by side builds the judgment to pick the right approach for whatever lands on your desk.
None of these use invented statistics. They illustrate the kinds of failures that appear when you evaluate honestly and the reasoning that turns a vague disappointment into a targeted fix.
Example 1: A Support Ticket Classifier
The task: label incoming support tickets as billing, technical, or account. The prompt worked beautifully in the demo, classifying three sample tickets correctly.
What the Evaluation Revealed
When the prompt ran against a 30-ticket test set with known correct labels, the picture changed. Accuracy was strong on tickets that clearly belonged to one category but collapsed on tickets that touched two, like a billing question about a technical feature. The prompt forced a single label and picked inconsistently on ambiguous cases.
The fix was twofold: the success criteria were updated to allow a primary and secondary label, and the prompt was given explicit guidance on how to break ties. Re-running the full set showed the ambiguous cases stabilizing. The lesson is that a reference-based test set with real labels exposes failure modes a demo never will.
Example 2: Extracting Dates From Contracts
The task: pull the effective date and termination date out of contract text. Structured output, so structured scoring applied.
What the Evaluation Revealed
Programmatic checks did the work here. The evaluation parsed each output as JSON and compared the extracted dates against a known answer per document. Most passed, but a cluster failed on contracts that wrote dates in words rather than digits, and another cluster failed when two dates appeared close together and the prompt swapped them.
Because the scoring was automated, running all 40 documents took seconds and the failure clusters were obvious. The fix added explicit handling for written-out dates and an instruction to anchor each date to its surrounding label. This example shows the power of programmatic scoring: cheap, deterministic, and brutally clear about exactly which inputs break.
For the broader scoring methods at play, see What Separates a Reliable Prompt From a Lucky One.
Example 3: Summarizing Meeting Notes
The task: produce a three-bullet summary of a meeting transcript. Open-ended, so there is no single correct answer to compare against.
What the Evaluation Revealed
Here a rubric did the work. Each summary was scored on three criteria — does it capture the decision, does it name the owner, does it respect the three-bullet format. Human raters scored a sample, and their disagreements revealed that the rubric itself was ambiguous about what counted as a decision.
Tightening the rubric language brought the raters into agreement, and only then did the prompt's true weakness appear: it reliably captured decisions but often omitted owners. The fix was a targeted instruction. The lesson is that for subjective tasks, calibrating the rubric comes before judging the prompt.
Example 4: A Customer-Facing Apology Email
The task: draft an apology email for a delayed order. Tone is everything, and tone is hard to score.
What the Evaluation Revealed
The evaluation combined two methods. Programmatic checks confirmed the email mentioned the order number and stayed under a length limit. A rubric scored warmth, accountability, and absence of over-promising. Running the same input several times exposed inconsistency: some drafts promised compensation the business had not authorized.
That variance was the real finding. A prompt that occasionally promises refunds is a liability no matter how good its average draft is. The fix added an explicit constraint against offering compensation, and re-running the input multiple times confirmed the unsafe outputs disappeared. This shows why measuring variance, not just one draft, is essential for customer-facing prompts.
To turn these scenarios into a repeatable routine, read A Step-by-Step Approach to Evaluating Prompt Quality. To see a single problem followed end to end, read Case Study: Evaluating Prompt Quality in Practice.
What the Four Examples Have in Common
Across all four scenarios, the same lesson recurs in different costumes. In each case, the demo or first impression suggested the prompt was finished, and in each case a structured evaluation against a representative set of inputs revealed a failure mode the demo had no way to surface. The classifier looked perfect until ambiguous tickets appeared. The extractor looked perfect until dates were spelled out. The summarizer looked perfect until the rubric was calibrated. The email looked perfect until variance exposed an unsafe promise.
The second shared lesson is that the right scoring method follows the task, not the other way around. Forcing rubric grading onto the date extractor would have wasted effort that automated comparison handled instantly. Forcing exact-match scoring onto the summarizer would have penalized perfectly good paraphrases. Choosing the method that fits each criterion is what made each evaluation both efficient and honest.
Choosing a Method by Task Type
A quick heuristic that emerges from these examples:
- Structured output with a known answer favors programmatic comparison.
- Classification or extraction favors reference-based accuracy metrics.
- Open-ended generation favors a calibrated rubric, human or model graded.
- Anything customer-facing or safety-sensitive demands variance measurement on top of whatever else you use.
Internalize that mapping and you can pick the right approach for a new prompt in seconds rather than defaulting to whatever you used last time. The mistake to avoid is reaching for the method you are most comfortable with regardless of the task, since a method that does not fit will either miss real failures or flag harmless variation as a problem.
One more thread runs through every example: the failure that surfaced was rarely the one the team expected. They braced for fluency problems and found fabrication; they braced for accuracy and found inconsistency. This is the ordinary experience of honest evaluation. You set out to confirm a prompt is good and instead learn something specific and surprising about how it breaks. Treating that surprise as the point, rather than an inconvenience, is what turns evaluation from a rubber stamp into a source of real improvement.
Frequently Asked Questions
How do I evaluate a prompt when there is no single correct answer?
Use a rubric with defined criteria scored on a short scale. For a summary, that might be whether it captures the decision, names the owner, and respects the format. Before trusting the scores, have a couple of people rate a sample and resolve disagreements, because rubric ambiguity, not the prompt, is often the first problem you uncover.
Why did automated scoring work so well for the date extraction example?
Because the output was structured and had a known correct answer. That let the evaluation parse each result and compare it programmatically, which is instant, deterministic, and free. Automation also made it trivial to run dozens of documents and see exactly which input types clustered into failures, pointing straight at the fix.
What made the apology email the hardest to evaluate?
Two things: tone is subjective, and the dangerous failure was inconsistent. The prompt's average draft was fine, but occasional drafts promised unauthorized compensation. That risk only surfaced by running the same input multiple times and watching the variance, which is why one good draft would have hidden a real liability.
Can I mix scoring methods in one evaluation?
Yes, and you often should. The apology email used programmatic checks for the order number and length plus a rubric for tone. Combining methods lets you score each requirement with the approach that fits it, automating what can be automated and reserving human or rubric judgment for the genuinely subjective parts.
Key Takeaways
- A reference-based test set with real labels exposes classifier failures that demos hide, especially on ambiguous inputs.
- Programmatic scoring of structured outputs is cheap and makes failure clusters obvious.
- For open-ended tasks, calibrate the rubric until raters agree before judging the prompt.
- Customer-facing prompts demand variance measurement, because rare unsafe outputs are real liabilities.
- Mixing scoring methods within one evaluation lets you match each requirement to the right approach.