Teams spend a lot of energy arguing about whether a prompt change helped, and almost none measuring it. Instruction hierarchy is especially prone to this, because priority failures are intermittent: a rule that loses one time in five feels fine in casual testing and causes real damage in production. The only way out is to put numbers on the behavior.
This article defines the metrics that actually reveal how well your hierarchy is working, how to instrument them without building a research lab, and how to read the signal each one gives. The instrumentation is lighter than you might expect, because the core idea is just running conflict-probing inputs repeatedly and counting outcomes.
We will cover the headline metric, the diagnostic metrics that locate problems, and the operational metrics that catch regressions in production, then how to read the signal and set targets that match the stakes. The throughline is that a metric you cannot act on is noise. Every number here exists to drive a specific decision: ship or block, which conflict to fix first, whether an edit helped or quietly broke something. If a metric does not change a decision, drop it.
The Headline Metric: Priority Win Rate
The single most important number is the fraction of conflict cases where the higher-priority instruction wins.
How to Define It
Build a set of test inputs, each designed to pit two instructions against each other with a known correct winner. Run each input multiple times. Priority win rate is the share of runs where the correct rule won. A rule that should always win but wins ninety percent of the time has a ten percent failure rate that production will find.
How to Read It
Track it per conflict type, not just in aggregate. An overall ninety-five percent can hide one conflict type sitting at sixty percent. The conflict-probing test suite from our case study is exactly the instrument that produces this number.
Diagnostic Metrics: Locating the Problem
Once the headline tells you there is a problem, these metrics tell you where.
Conflict Coverage
The fraction of your prompt's known instruction pairs that have at least one test case. Low coverage means your win rate is measured on a biased sample and the real number could be worse. Aim to cover every pair you identified during conflict enumeration.
Variance Across Runs
How much the output changes when you run the same conflict input repeatedly. High variance is itself a signal: it means the model has no stable tiebreaker and is guessing, which points straight at a missing precedence statement. The enumeration method behind these pairs is described in A Working Checklist for Keeping Prompt Instructions in Order.
Constraint Violation Rate
For Tier 1 hard constraints specifically, the fraction of runs where the constraint was violated at all. This number should be effectively zero. Any nonzero value is a higher priority to fix than any preference-level conflict.
Operational Metrics: Catching Regressions
These run against production traffic rather than test inputs.
Post-Edit Win Rate Delta
The change in priority win rate before and after a prompt edit. This catches the common case where fixing one conflict quietly breaks another. Run your full conflict suite on every edit and watch the delta, not just the absolute number.
Sampled Production Conflict Rate
Periodically sample real production outputs and check whether any show signs of a lower-priority rule winning. This catches conflicts your test suite never anticipated, especially from novel user inputs. Tooling that supports this sampling is surveyed in What to Look For in Tooling That Catches Prompt Conflicts.
Instrumenting Without Overbuilding
You do not need a heavy platform to get these numbers.
The Minimum Viable Setup
A list of conflict test inputs with expected winners, a script that runs each input several times, and a simple grader, whether a string check or a model-based judge, that records whether the correct rule won. That alone produces priority win rate, variance, and constraint violation rate.
Where to Invest Next
Once the basics run, add per-conflict-type breakdowns and wire the suite into your edit workflow so the post-edit delta is automatic. The structure of what you are measuring maps directly onto The Tiered Precedence Model for Untangling Prompt Conflicts.
Reading the Signal Correctly
Numbers are only useful if you interpret them honestly, and conflict metrics have a few traps worth naming.
Aggregate Win Rate Hides the Cases That Matter
A ninety-five percent overall win rate sounds healthy, but if all five percent of failures land on one safety-critical conflict, you have a serious problem dressed up as a good number. Always read win rate sliced by conflict type and by tier, never as a single headline figure. The worst conflict, not the average, is what determines whether you can ship.
Low Variance Is Not the Same as Correct
A prompt can fail a conflict the same way every time, producing low variance and a stable but wrong result. Variance tells you whether the model has a stable tiebreaker, not whether the tiebreaker is the one you wanted. Pair variance with win rate: low variance plus low win rate means the model consistently prefers the wrong rule, which is a precedence-statement problem.
A Rising Win Rate Can Still Hide a Regression
Because fixing one conflict can break another, an improved aggregate win rate can mask a brand-new failure in a different conflict type. This is exactly why the post-edit delta must be computed per conflict type, not just overall. A green aggregate with one newly red category is a regression, not a win.
Setting Targets That Match Stakes
Not every metric deserves the same threshold, and treating them uniformly wastes effort or hides risk.
Tier 1 Constraints: Zero Tolerance
Hard constraints should target a constraint violation rate of zero across all runs. There is no acceptable rate of safety or policy violations, so any nonzero value blocks release until resolved. These get the most test runs because rare failures still count.
Preferences: High but Pragmatic
Tone, formatting, and length preferences can tolerate occasional losses, especially when they lose to a legitimate higher-priority rule. Chasing perfection here costs effort better spent on constraints. Set a sensible floor, monitor the trend, and move on. Matching thresholds to stakes is the same risk-weighting logic that governs tool selection in What to Look For in Tooling That Catches Prompt Conflicts.
Frequently Asked Questions
Why measure win rate instead of just reading outputs?
Because priority failures are intermittent. Reading a handful of outputs that happen to look fine tells you nothing about the one-in-five case that fails, and that case is where production damage comes from.
How many times should I run each conflict input?
Enough to make an intermittent failure visible. Five to ten runs per input is a reasonable starting point; increase it for hard constraints where even rare failures are unacceptable.
What is a good target for priority win rate?
For preferences, high but not necessarily perfect. For Tier 1 hard constraints, effectively one hundred percent, because constraint violations are failures rather than tradeoffs and should be driven to zero.
How do I grade outputs without manual review of every run?
Use a programmatic check where the correct behavior is detectable by string or structure, and a model-based judge where it requires interpretation. Reserve manual review for spot-checking the grader itself. The trick is to design conflict cases so the correct outcome is mechanically detectable wherever possible: if completeness should win, test with an input where a complete answer contains a specific step a brief answer would omit, then check for that step. Mechanical grading scales to thousands of runs in a way human review never will.
Should priority metrics gate a release, or just inform it?
Both, but with different thresholds. Tier 1 constraint violations should hard-gate a release, because shipping a known safety failure is never acceptable. Preference-level win rates should inform rather than block, surfacing in a dashboard so you can weigh a minor regression against the value of shipping. Conflating the two either blocks releases over trivia or lets real violations through.
Key Takeaways
- Priority failures are intermittent, so they must be measured, not eyeballed.
- Priority win rate, the share of conflict cases where the right rule wins, is the headline metric.
- Track win rate per conflict type, since aggregates hide individual weak spots.
- Use conflict coverage, run variance, and constraint violation rate to locate and prioritize problems.
- Watch the post-edit win rate delta to catch fixes that break other conflicts.
- A list of conflict inputs, a multi-run script, and a simple grader are enough to start.