The complaint arrived on a Monday. A law firm using an AI intake assistant our team had built reported that roughly one in six new matters was landing in the wrong queue. People asking about an existing case were being routed as new prospective clients, and the firm's intake coordinator was manually re-sorting them every morning. The misrouting was not random; it clustered around a specific kind of message, and that pattern is what eventually pointed us at the fix.
This is the story of how a single contrastive prompting change resolved the problem, told in sequence: the situation we inherited, the decision about how to attack it, the execution, the numbers that came back, and what the team carried forward. None of the figures here are invented for effect; they are the kind of before-and-after a small agency team can actually observe in a few weeks of monitoring.
The reason this case is worth telling is that the obvious fixes did not work. We tried clearer instructions, a bigger taxonomy description, and more few-shot examples. The breakthrough came only when we stopped describing the categories and started showing the model the exact mistake it was making.
It is worth being honest that this sequence felt frustrating in the moment. Each failed attempt looked reasonable on paper, and each one produced a prompt that read more thoroughly than the last. The lesson the team took away is that thoroughness and effectiveness are different things. A longer category description can make a human reviewer feel the prompt is more complete while doing nothing for the boundary that is actually failing. The model did not need more words about the categories; it needed to be shown the line it kept crossing.
The Situation We Inherited
The intake assistant classified inbound messages into "new matter" and "existing matter," then routed accordingly.
Where it broke down
The confusion concentrated on messages like "I have a question about my case." A new prospect might say the same thing about a potential case. The word "my" did the work in the human reader's head, but the model treated both as equally likely new business, defaulting to the higher-stakes queue.
Why the first attempts failed
We rewrote the category definitions twice, adding more detail each time. Accuracy on the confusable messages barely moved. The model did not lack a definition; it lacked a sense of which signal separated the two when both mentioned a "case."
The Decision
Rather than keep expanding instructions, the team chose to isolate the one boundary that was failing and teach it with a contrastive pair.
Framing the boundary
We articulated the actual distinguishing feature in one sentence: an existing matter references something already in progress, while a new matter describes a situation the firm has not yet been engaged on. The presence of "my" was a weak signal; the real tell was whether prior engagement was implied.
Choosing the examples
We picked two messages that were nearly identical in surface form but fell on opposite sides of the line. Holding the wording close was deliberate, the discipline described in Why Confounded Example Pairs Quietly Sabotage Prompts.
The Execution
The change was small in size and large in effect.
The contrastive pair we added
- Existing matter: "I have a question about my case from last spring" — prior engagement is implied by the reference to an established timeline.
- New matter: "I think I might have a case and want to talk to someone" — no prior engagement; the person is exploring whether to hire the firm.
Each line carried a short justification so the model could see which feature drove the label, the approach laid out in Building a Disambiguation Prompt From One Clean Pair.
Guarding against regression
Before shipping, we ran the new prompt against a held-out set of one hundred past messages that had been hand-labeled. We watched not only the confusable pair but the categories that were already working, to confirm the change did not introduce new errors elsewhere.
The Outcome
The held-out evaluation showed misrouting on the confusable message type fall from roughly sixteen percent to under four percent, with no measurable degradation on the messages that had already been classified correctly.
What the coordinator saw
In production over the following two weeks, the intake coordinator's morning re-sorting dropped from a daily chore to a few corrections a week. The team had not added a model, retrained anything, or increased cost in any meaningful way; the prompt grew by a few lines. Tracking that production signal followed the method in Instrumenting Disambiguation So You Can See It Working.
The cost picture
Token usage rose by a negligible amount per request. Against the labor saved, the change paid for itself almost immediately, the kind of arithmetic explored in Putting Numbers Behind a Disambiguation Investment.
The Lessons Carried Forward
Three things stuck with the team after this engagement.
Definitions describe; examples discriminate
More words about a category rarely fix a boundary problem. A boundary problem needs a boundary example. The team now reaches for a contrastive pair early whenever two labels sit close together.
Keep the pair surgically narrow
The win came from two messages that differed on exactly one thing. The moment the examples vary on several dimensions, the model cannot tell which one you meant.
Measure the categories you did not touch
The held-out check on the working categories was what gave the team confidence to ship. A disambiguation fix that quietly breaks something else is not a fix.
What Happened Next
The engagement did not end at one boundary, and the follow-on work is part of the lesson.
The second boundary
A month later, the firm reported a smaller but real confusion between "personal injury" and "general liability" matters in the routing. Armed with the same approach, the team built a contrastive pair in an afternoon, validated it against the existing held-out set, and shipped it. The second fix was faster than the first precisely because the evaluation set already existed and the team no longer wasted attempts on instruction rewrites.
Turning a fix into a practice
What started as a one-off rescue became the firm's standard way of handling any new misrouting pattern. The intake coordinator now flags clustered errors, the team confirms the cluster, names the feature, builds a pair, and validates. The cost of each subsequent boundary fix kept falling because the expensive part — the labeled evaluation set — was already paid for, the amortization argued in Pricing the Payback on a Single Disambiguation Fix. The deeper outcome was not the resolved boundary; it was a repeatable habit the firm could lean on as its intake patterns evolved.
Frequently Asked Questions
How long did the whole fix take?
From the first complaint to a shipped, validated change was about three working days, most of it spent assembling and labeling the held-out evaluation set rather than writing the prompt itself.
Why not just fine-tune the model on the firm's data?
The volume of confusable messages was small and the boundary was crisp once stated. A contrastive pair solved it in hours, whereas a fine-tune would have meant data collection, training, and ongoing maintenance for a problem that did not need it.
Did the contrastive pair help with any other categories?
Not directly, and that was expected. A contrastive pair targets a specific boundary. The team added separate pairs later for two other confusable category combinations as they surfaced.
How did you know the improvement was real and not noise?
The held-out evaluation set was labeled by hand before the change and held constant. The same one hundred messages went through both prompts, so the before-and-after comparison controlled for everything except the prompt edit.
What would you do differently next time?
Build the held-out evaluation set first, before trying any fix. We wasted two attempts on instruction rewrites we could have ruled out faster if we had been measuring against a fixed set from the start.
Key Takeaways
- The misrouting clustered on a single confusable message type, which is the signature of a boundary problem rather than a definition problem.
- Expanding category descriptions did nothing; a single contrastive pair cut misrouting from sixteen percent to under four.
- The two example messages were nearly identical in wording and differed only on whether prior engagement was implied.
- A held-out, hand-labeled evaluation set was what made the result trustworthy and protected the categories that already worked.
- The fix was cheap in tokens and large in saved labor, paying for itself almost immediately.