A bad prompt that you know is bad costs you nothing — you fix it. The expensive failures are the prompts you believe are good because your evaluation told you so, and then they fall apart in production. Most evaluation disasters are not random; they come from a short list of recurring mistakes that inflate confidence without improving the prompt.
This guide names seven of those mistakes. For each one, you will see why it happens, what it costs, and the specific practice that corrects it. The pattern across all of them is the same: an evaluation that feels rigorous while quietly measuring the wrong thing.
If you have ever shipped a prompt that tested well and then disappointed you, one of these is probably why.
Mistake 1: Testing on a Single Example
The most common error is judging a prompt by one output. It feels efficient and the result is concrete, so it is tempting to stop there.
The cost is that a single output reveals nothing about reliability. Models are probabilistic and inputs vary, so one good answer can be pure luck. The corrective practice is to always test against a set of inputs — at least 15 to 30 — and to run important inputs several times to observe consistency.
Mistake 2: Tuning the Prompt Against Your Test Set
This one is subtle because it looks like diligence. You test, the prompt fails on case 7, you tweak it until case 7 passes, repeat for every case, and end with a perfect score.
The problem is you have now memorized your test set rather than improved the prompt. The score is inflated and will not hold on new inputs. The fix is to hold out a separate evaluation set that you never look at while editing. Tune on one set, measure on another, the way machine learning practitioners separate training and test data.
Mistake 3: Vague Success Criteria
If "good" is undefined, every output looks acceptable in the moment because you judge it on whatever standard feels right at that second.
Vague criteria make scores meaningless and irreproducible — two people score the same output differently, and so do you on different days. Write specific, testable criteria before evaluating. "Three bullets, each under 20 words, no marketing language" can be checked; "high quality" cannot.
Mistake 4: Trusting a Model Grader You Never Validated
Using another model to score outputs scales beautifully, which is exactly why people adopt it without checking it.
The risk is that the grader has its own biases and blind spots, so your whole evaluation rests on an unverified judge. The corrective practice is to validate the grader against human ratings on a sample of 30 to 50 outputs before trusting it at scale, and to spot-check it periodically afterward.
For the broader scoring picture, see What Separates a Reliable Prompt From a Lucky One.
Mistake 5: Ignoring Consistency
Many evaluations measure whether a prompt can produce a good answer, not whether it reliably does.
A prompt that passes 60 percent of the time but happened to pass on your single run looks identical to a rock-solid prompt if you never measure variance. The cost shows up as intermittent production failures that are maddening to debug. Run each input multiple times and track the spread, treating a prompt's worst-case behavior as seriously as its best.
Mistake 6: A Stale or Unrepresentative Test Set
Test sets rot. The inputs you curated six months ago may no longer reflect the messy, evolving reality of production traffic.
When your test set drifts from real inputs, a high score gives false comfort while real users hit cases you never tested. The fix is to refresh your test set periodically with real production samples, especially the failures and the new input types you did not anticipate.
Mistake 7: Optimizing Only for Quality, Ignoring Cost and Latency
It is easy to fixate on the quality score and pick the variant with the highest number.
But a prompt that scores two points higher while tripling cost and doubling latency may be the wrong choice for a high-volume feature. Evaluate quality alongside token cost, latency, and failure rate, and choose the cheapest, fastest variant that clears your quality floor rather than the single highest scorer.
To build these corrections into a durable routine, read Evaluating Prompt Quality: Best Practices That Actually Work, and to keep yourself honest each time, use The Evaluating Prompt Quality Checklist for 2026.
What These Mistakes Share
Step back and a pattern emerges across all seven. Each one is a way of making an evaluation feel rigorous while quietly measuring the wrong thing. Testing once feels concrete but measures luck. Tuning against the test set feels diligent but measures memorization. Vague criteria feel flexible but measure your mood. An unvalidated grader feels scalable but measures the grader's biases. Ignoring consistency feels efficient but measures a single lucky draw. A stale test set feels established but measures a world that no longer exists. Chasing quality alone feels principled but measures a number that may not bind your product.
The defense against all of them is the same instinct: ask what your evaluation is actually measuring, not what it appears to measure. Every time you produce a score, you should be able to name the specific way it could be fooling you and confirm you have guarded against it. That habit of suspicion toward your own numbers is what separates evaluations that protect you from evaluations that merely reassure you.
A useful practice is to write a short note alongside each evaluation that states its limits explicitly: what the test set does not cover, which criteria were scored subjectively, and where the result might not generalize. This feels pessimistic, but it is the opposite of self-defeating. By naming the gaps, you turn unknown risks into known ones, and known risks can be watched, tested, or accepted deliberately rather than discovered painfully in production. An evaluation that honestly states its blind spots is far more trustworthy than one that presents a clean number with no caveats, because the clean number is almost always hiding something.
Finally, remember that none of these mistakes are signs of carelessness. They are the default behaviors of a reasonable person under time pressure, which is exactly why they are so common. Avoiding them is not about being smarter; it is about building habits and artifacts that make the honest path the easy one.
Frequently Asked Questions
Why is tuning against the test set such a serious problem?
Because it inflates your score without improving the prompt. When you edit until every test case passes, you have effectively memorized those examples, and the prompt will likely fail on inputs it has not seen. Keeping a separate held-out set that you never tune against is the only reliable way to measure true performance.
How do I know if my test set has gone stale?
Compare it against recent production traffic. If real inputs include formats, lengths, or topics your test set does not cover, or if production failures involve cases you never tested, the set has drifted. Refreshing it with real samples, especially recent failures, keeps your evaluation honest.
Is using a model to grade outputs a bad idea?
No, it is a useful technique that scales far better than human grading. The mistake is trusting it blindly. Validate the grader against human ratings on a sample first, confirm it agrees with people most of the time, and spot-check it periodically. Done that way, model grading is reliable for bulk scoring.
How much should cost and latency influence my choice?
As much as the use case demands. For a rare, high-stakes call, a small quality gain may justify higher cost. For a high-volume background task, tripling cost for two extra quality points is usually a bad trade. Set a quality floor, then optimize cost and latency beneath it.
Key Takeaways
- Never judge a prompt on a single output; test against a set and run inputs multiple times.
- Keep a held-out evaluation set you never tune against to avoid inflated scores.
- Write specific, testable success criteria before you evaluate anything.
- Validate any model grader against human ratings before trusting it at scale.
- Measure consistency and refresh your test set with real production samples.
- Balance quality against cost and latency rather than chasing the highest score alone.