AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Disambiguating a Classification BoundaryWhat the contrastive pair looked likeWhy it workedPinning Down an Extraction TargetThe ambiguityThe contrastive fixControlling Tone Without Over-SteeringPairing two near-missesThe lessonWhen the Contrast Made Things WorseWhat went wrongThe repairStacking Examples Without Diluting the SignalDiminishing returnsDisambiguating Sentiment From TopicThe leakSeparating the axes with a pairThe broader lessonFrequently Asked QuestionsHow many contrastive examples should a single prompt include?What is the difference between a contrastive example and a regular few-shot example?Why did my contrastive example make accuracy drop?Can contrastive prompting replace fine-tuning for disambiguation?Should the wrong example be a plausible mistake or an obvious one?Key Takeaways
Home/Blog/Showing the Model Both Wrong and Right Reads
General

Showing the Model Both Wrong and Right Reads

A

Agency Script Editorial

Editorial Team

·February 23, 2020·7 min read
contrastive prompting for disambiguationcontrastive prompting for disambiguation examplescontrastive prompting for disambiguation guideprompt engineering

Ambiguity is the quiet killer of prompt reliability. A model does not announce that it misread your intent. It produces a fluent, confident answer to a question you did not ask, and the gap only surfaces later when a client flags the output. Contrastive prompting attacks this problem directly: instead of describing what you want in the abstract, you show the model a wrong interpretation next to a right one, so the boundary between them becomes explicit.

This article walks through concrete scenarios where that technique either rescued an ambiguous task or quietly failed. The point is not to hand you a template. It is to develop intuition for when a contrastive pair sharpens behavior and when it adds noise. Each example below comes from the kinds of work AI agencies actually ship: classification, extraction, tone control, and routing.

We will look at what made each case succeed or break down, because the failures are more instructive than the wins. A contrastive example placed badly can anchor the model on the wrong axis of difference, and you will spend hours wondering why a "clear" instruction produced garbage.

Disambiguating a Classification Boundary

The most common place ambiguity bites is when two labels are close. A support-ticket classifier had to separate "billing question" from "refund request." Plain instructions kept collapsing the two, because every refund mentions money.

What the contrastive pair looked like

The fix was to show one example of each, side by side, with a one-line note on why each landed where it did:

  • Positive read: "I was charged twice this month" is a billing question because the user wants an explanation, not money back.
  • Negative read: "I want my money back for last month" is a refund request because the user is demanding action on funds already paid.

Why it worked

The pair isolated the single distinguishing feature — intent to reclaim funds versus intent to understand a charge. The model had been keying on the word "charge" before. Once the contrast made intent the salient axis, accuracy on the confusable pair jumped, and the rest of the taxonomy stayed stable.

Pinning Down an Extraction Target

A second case involved pulling the "decision maker" from meeting notes. The notes mentioned several people, and the model kept extracting whoever spoke most.

The ambiguity

"Decision maker" is underspecified. Loudest? Highest title? The person who said "let's do it"? Without a contrast, the model guessed, and it guessed inconsistently across documents.

The contrastive fix

We showed a note where the VP talked the most but the director gave final sign-off, and labeled the director as the answer with a note: the decision maker is whoever grants approval, regardless of speaking time. A second pair reinforced that title alone does not settle it. Extraction stabilized once the model had a clear negative — "not the most talkative" — to push against.

Controlling Tone Without Over-Steering

Tone instructions are notoriously slippery. A client wanted "confident but not arrogant" marketing copy, and the model swung between bland and boastful.

Pairing two near-misses

Rather than define confidence abstractly, we gave two rejected drafts and one accepted draft:

  • Rejected as arrogant: "No competitor comes close to what we deliver."
  • Rejected as timid: "We think we might be able to help with some of your needs."
  • Accepted: "We have shipped this for forty agencies, and we will show you the results before you commit."

The accepted line carried evidence; the rejected ones carried either empty superlatives or hedging. The contrast taught the model that confidence meant specificity, not volume.

The lesson

Tone is easier to teach by triangulation than by definition. Two flanking failures plus one success draws a tighter box than any adjective. For more on this pattern see Writing Negative Examples That Actually Constrain a Model.

When the Contrast Made Things Worse

Not every pairing helped. A routing prompt that sorted leads by industry got worse after we added contrastive examples.

What went wrong

The two examples we picked differed on three things at once — industry, company size, and region. The model could not tell which difference mattered, so it started weighting region. We had introduced a confound. A contrastive pair only works if it varies on exactly the dimension you care about and holds everything else constant.

The repair

We rebuilt the pair so both examples were the same size and region and differed only by industry. The confusion cleared immediately. This is the single most important discipline in the technique, and it is covered in depth in Trade-offs Between Contrastive Pairs and Plain Instructions.

Stacking Examples Without Diluting the Signal

A final scenario: a team kept adding contrastive pairs hoping more would help. By the eighth pair, performance plateaued and latency climbed.

Diminishing returns

Each pair after the first three addressed an edge case that occurred in under one percent of traffic. The added tokens cost money and slowed responses without moving the metric that mattered. We trimmed back to four high-value pairs and recovered both speed and clarity. Measuring this trade-off is the subject of Reading the Signal From Disambiguation KPIs.

Disambiguating Sentiment From Topic

One more scenario is worth including because it shows a subtler kind of confusion. A product-feedback classifier was supposed to tag the feature a comment referred to, but it kept letting the comment's sentiment leak into the tag.

The leak

Angry comments about checkout were being tagged "checkout bug," while calm comments about the same checkout flow were tagged "checkout feedback," even when both described the identical issue. The model had fused two independent axes — topic and tone — into one decision.

Separating the axes with a pair

We showed two comments about the same checkout problem, one furious and one measured, both labeled with the same topic tag, plus a note: the tag reflects what the comment is about, not how the writer feels. A second pair reinforced that a polite comment and an angry comment about different features should get different tags. Forcing the model to see tone held constant while topic varied, and topic held constant while tone varied, broke the fusion. This is the single-axis discipline pushed in Building a Disambiguation Prompt From One Clean Pair, applied to a case where two axes had collapsed into one.

The broader lesson

When a model conflates two independent properties, one contrastive pair is rarely enough. You need pairs that vary each axis while holding the other fixed, so the model learns they are separable. A single pair can teach a boundary; teaching independence takes a small matrix.

Frequently Asked Questions

How many contrastive examples should a single prompt include?

Start with one well-chosen pair and add only when a failure mode demands it. Most ambiguous tasks resolve with two or three pairs. Beyond five, you usually pay in tokens and latency without gaining accuracy, and you risk burying the primary signal.

What is the difference between a contrastive example and a regular few-shot example?

A few-shot example shows a correct input-output pair. A contrastive example deliberately pairs a wrong interpretation with a right one so the model learns the boundary, not just the target. The negative is the load-bearing part.

Why did my contrastive example make accuracy drop?

Almost always because the two examples differed on more than one dimension. The model latched onto the wrong difference. Rebuild the pair so it varies only on the axis you care about and holds everything else constant.

Can contrastive prompting replace fine-tuning for disambiguation?

For many boundary problems, yes. Contrastive prompts are faster to iterate and cheaper to maintain than a fine-tune. Fine-tuning still wins when the ambiguity appears across thousands of subtle cases that no handful of examples can cover.

Should the wrong example be a plausible mistake or an obvious one?

A plausible mistake. The whole value comes from showing the model the error it is actually likely to make. An obviously wrong example teaches nothing because the model was never going to produce it.

Key Takeaways

  • Contrastive examples work by making the distinguishing feature explicit, not by adding more correct samples.
  • Pair a wrong interpretation with a right one and vary only the single dimension that matters; confounds wreck the technique.
  • Tone and other fuzzy targets are best taught by triangulating two flanking failures around one success.
  • More pairs is not better; three to five high-value pairs usually exhaust the gains before latency and cost climb.
  • The negative example should reflect the mistake the model is genuinely prone to make, or it teaches nothing.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification