You already have a held-out set, a rubric, and a habit of scoring blind. That gets you a long way. But once you push evaluation into agentic systems, high-stakes decisions, and adversarial conditions, the simple recipe starts to leak. A single-turn accuracy number cannot tell you whether an agent took a reckless path to the right answer. An LLM judge that worked last quarter may have silently drifted. A benchmark you trust may be quietly contaminated. This is where evaluation stops being a checklist and becomes a craft.
This article is for practitioners past the fundamentals. We will go deep on ai model leaderboards and evaluation advanced topics: scoring trajectories rather than outputs, calibrating and stress-testing LLM-as-judge, defending against contamination and gaming, and handling the statistical edge cases that trip up confident teams. The assumption is that you have read the framework and the best practices and want the hard parts.
Trajectory Evaluation for Agentic Systems
When a model takes multiple steps and calls tools, the final answer is only part of the story. The path matters.
Why output-only scoring fails for agents
An agent can reach a correct answer through a dangerous or wasteful route: deleting data it should have read, making twelve API calls where three would do, or recovering from a self-inflicted error by luck. Output-only scoring rewards all of that. To evaluate agents honestly, you score the trajectory.
What to score on the path
- Action safety: did any step take an irreversible or out-of-scope action? A single unsafe action can outweigh a correct result.
- Step efficiency: how many steps and tool calls did it take? Fewer is cheaper and has less failure surface.
- Error recovery quality: when the agent went wrong, did it detect and correct intelligently, or stumble into success?
- Tool-use correctness: were the right tools called with the right arguments, independent of the final answer?
Building this looks more like writing integration tests than grading essays. You assert on intermediate states, not just the end.
Mastering LLM-as-Judge
Automated judging scales evaluation, but a careless judge manufactures false confidence at scale, which is worse than no judge at all.
Calibrate against humans, repeatedly
Before trusting a judge, score a sample by hand and measure agreement. Then re-measure on a cadence, because both the judge model and your task drift. A judge is an instrument; uncalibrated instruments lie.
Defend against known judge biases
LLM judges have documented biases: they favor longer responses, prefer the first option presented, and reward confident tone over correctness. Counter these by randomizing position, normalizing for length where possible, and writing rubrics that explicitly reward accuracy over fluency. Our risks article treats these failure modes as governance concerns.
Use a jury for high-stakes calls
For decisions that matter, aggregate multiple judge models or multiple runs rather than trusting a single pass. Disagreement among judges is itself a useful signal that a case is genuinely hard.
Defending Against Contamination and Gaming
If your eval can be gamed or memorized, its number is theater.
Detecting contamination
Suspect contamination when a model scores suspiciously well on a public set but poorly on near-identical private variants. A practical defense is to perturb examples, such as changing names, numbers, or phrasing, and check whether performance collapses. Memorized answers do not survive perturbation.
Keeping a private holdout truly private
The instant your test set appears in a prompt, a fine-tune, or a shared doc, it is burned. Rotate a fresh slice you never expose, and treat the sealed set with the discipline of a secret. The getting started guide introduces this; at the advanced level you enforce it ruthlessly.
Synthetic and adversarial generation
Generate fresh test cases targeting specific weaknesses rather than reusing fixed sets. Adversarial generation, where you deliberately construct inputs designed to break the model, surfaces failure modes that representative sampling misses.
The Statistics Practitioners Get Wrong
Confident teams make subtle statistical errors.
- Ignoring variance. A 52 percent win rate on 30 examples is well within noise. Report confidence intervals and run enough samples to distinguish signal from chance.
- Multiple comparisons. Test enough models or prompts and one will look great by luck. Correct for the number of comparisons or you will ship noise.
- Aggregation hiding regressions. A flat overall score can mask a collapse on a critical segment. Always evaluate by segment, not just in aggregate.
- Optimizing the metric instead of the goal. When a metric becomes a target, models and teams learn to satisfy it without satisfying the underlying intent. Keep a qualitative review in the loop.
Designing Evaluations That Survive Model Upgrades
A subtle advanced concern is longevity. You invest heavily in an eval, then a model upgrade changes behavior in ways your test set never anticipated, and your carefully built evaluation quietly stops measuring the things that now matter. Robust evaluation design plans for this.
Separate stable criteria from volatile ones
Some of what you measure is durable, such as "never fabricate a number" or "never take an irreversible action without confirmation." Other criteria are tied to a specific model's quirks. Structure your rubric so the durable criteria form a stable spine that survives upgrades, while model-specific checks live in a clearly separated, easily revised layer. When a new model arrives, you revise the volatile layer without rebuilding the spine.
Maintain a regression suite of past failures
Every real failure you catch should become a permanent test case. Over time this regression suite becomes your most valuable asset: a memory of every way your system has broken, which a new model must clear before it ships. This is how evaluation compounds rather than resets with each upgrade.
Re-validate judges and thresholds after every major upgrade
A judge calibrated against one generation of model outputs may misjudge a newer one, and a threshold tuned to old behavior may be too loose or too strict. Treat a major upgrade as a trigger to re-validate both, not as a free pass.
Frequently Asked Questions
Why is output-only scoring insufficient for agents?
Because an agent can reach a correct answer through an unsafe or wildly inefficient path, and output-only scoring rewards that. You need to evaluate the trajectory: action safety, step efficiency, error recovery, and tool-use correctness. A correct result obtained by a reckless route is not actually a passing result.
How do I keep an LLM judge trustworthy over time?
Calibrate it against human scores on a sample before trusting it, then re-measure agreement on a regular cadence, since both the judge and your task drift. Also counter known biases toward length, position, and confident tone. Treat the judge as an instrument that needs ongoing calibration, not as ground truth.
How can I tell if a benchmark is contaminated?
Look for a model that scores very well on a public set but poorly on near-identical private variants, and perturb your examples by changing names, numbers, or phrasing. If performance collapses under perturbation, the model was likely relying on memorization. Genuine capability survives small surface changes.
When should I use multiple judges instead of one?
For high-stakes decisions where a single judge's error is costly. Aggregating multiple judge models or multiple runs reduces idiosyncratic mistakes, and disagreement among them flags genuinely hard cases worth human review. For routine, low-stakes scoring, a single calibrated judge is usually fine.
What is the most common statistical mistake in evaluation?
Ignoring variance and over-reading small differences. A narrow win rate on a few dozen examples is often pure noise, yet teams ship on it. Report confidence intervals, run enough samples, correct for multiple comparisons, and segment results so an aggregate does not hide a regression.
Key Takeaways
- For agentic systems, score the trajectory, including action safety, step efficiency, error recovery, and tool use, not just the final output.
- Treat LLM-as-judge as an instrument: calibrate against humans repeatedly, counter length and position biases, and use a jury for high-stakes calls.
- Defend against contamination by perturbing examples, keeping holdouts truly private, and generating fresh adversarial test cases.
- Respect the statistics: report variance, correct for multiple comparisons, segment before concluding, and keep a human in the loop so the metric does not replace the goal.