The best way to understand negative prompting is to watch it solve a real problem over several rounds, including the missteps. This is a composite case study, assembled from the kind of work that happens when a team takes a misbehaving AI feature and tightens it through disciplined constraint-writing. The names are generic, but the arc, the dead ends, and the eventual fix mirror how this actually plays out.
The situation: a small product team had shipped an AI support assistant that answered customer questions inside their app. It was accurate but exhausting. Every answer ran four paragraphs, restated the question, apologized preemptively, and ended by inviting the user to reach out again. Support satisfaction was dropping even though the answers were correct. The team suspected the verbosity was the problem and decided to fix it with constraints rather than retraining anything.
What follows is the decision, the execution across several iterations, the measured outcome, and what they took away. The method they used is the same one laid out in Build a Working Exclusion in Six Concrete Steps.
The Situation and the Decision
The team had a working assistant and a clear complaint: answers were too long and too hedgy. They could not change the model, but they fully controlled the system prompt.
Defining the problem in observable terms
Rather than say "make it less annoying," they listed concrete symptoms: answers averaged 180 words when 50 would do, restated the user's question verbatim, opened with an apology, and closed with a generic invitation. Each symptom was something they could count, which made it a candidate for a constraint.
Choosing constraints over alternatives
They considered fine-tuning and few-shot examples but chose negative prompting first because it was the fastest, cheapest thing to try and fully reversible. If it failed, they had lost an afternoon, not a sprint.
Round One: The Naive Attempt
Their first system prompt addition was blunt: "Do not be verbose. Do not apologize. Do not restate the question."
What happened
Mixed results. Apologies mostly stopped. But verbose was too vague, so answer length barely moved, and the model still restated questions in slightly reworded form, technically obeying the letter while missing the intent.
The diagnosis
Two of three constraints were not gradeable. The team recognized this as the most common failure mode, the vague prohibition described in 7 Reasons Your Exclusions Get Ignored, and went back to rewrite.
Round Two: Making the Constraints Gradeable
The team rewrote each fuzzy negative into something measurable and paired it with a positive direction.
The revised rules
- Answer in 50 words or fewer.
- Do not restate or paraphrase the user's question; answer it directly.
- Do not open with an apology; begin with the answer itself.
- Do not end with a generic invitation to reach out.
What happened
A large improvement. Length dropped sharply, the restating stopped, and the apologies were gone. But a new problem appeared: some answers now felt curt to the point of rudeness, especially when the user was clearly frustrated. The team had overcorrected.
Round Three: Fixing the Overcorrection
The curtness was a textbook overcorrection: a constraint hitting its target while damaging tone.
The adjustment
They softened a single constraint rather than several, following the one-variable rule. They changed the opening rule to: "Begin with the answer. If the user expresses frustration, lead with one brief sentence of acknowledgment, then the answer." This carved out an exception for the case that mattered without reintroducing blanket apologies.
What happened
The curtness resolved for frustrated users while concise answers held everywhere else. The team had reached a stable prompt after three measured iterations, each changing as little as possible.
The Measured Outcome
Because the team had defined the problem in countable terms, they could check whether the fix worked.
What they tracked
They compared a sample of answers before and after. Average answer length fell from roughly 180 words to under 60. Apologetic openers dropped to near zero. Question restating disappeared. Crucially, they also reviewed a set of frustrated-user conversations by hand to confirm the tone exception was working, since that was the dimension hardest to count.
Why the comparison mattered
Without the before sample, they could not have proven the change helped versus merely felt different. The A/B discipline is what separated this from guesswork, a point emphasized in Opinionated Rules for Constraints That Hold.
The Lessons They Took Away
The team distilled their experience into a few transferable principles.
Define before you constrain
The work only became tractable once the vague complaint became a list of countable symptoms. Observable problems yield gradeable constraints.
Expect overcorrection
Every aggressive negative risks damaging something adjacent. Planning for an overcorrection pass turned a surprise into a routine step.
Iterate one variable at a time
Changing a single constraint per round let them attribute each effect to a cause, turning debugging into diagnosis instead of guesswork.
Save the result
They added the final rules block to their internal prompt library, labeled with the model and date, so the next assistant they built started from a tested baseline rather than zero.
What Changed in How the Team Worked
Beyond the prompt itself, the exercise shifted how the team approached every AI feature afterward.
From hoping to measuring
Before this, the team had treated prompt tweaks as a matter of taste: someone would adjust wording, eyeball the result, and ship if it felt better. The support-bot episode replaced that with a habit of defining the problem in countable terms and keeping a before sample. The same discipline carried into later features, where they now started by asking what they could measure rather than what felt off.
A shared vocabulary for failures
The team also gained shorthand for the failure modes they had hit. When a later prompt produced a vague non-improvement, someone would say it was not gradeable and reach for a measurable rewrite without a long debate. When an output went curt, they named it overcorrection and softened a single rule. Naming the patterns turned what had been frustrating guesswork into a quick, shared diagnostic, and it made onboarding new teammates to prompt work substantially faster.
Frequently Asked Questions
Why did the first attempt mostly fail?
Because two of its three constraints were not gradeable. Do not be verbose relies on a definition of verbose the model does not share, so length barely moved, and do not restate the question was loose enough that the model paraphrased instead. Only the apology rule, which was concrete enough to check, clearly worked. The fix was rewriting each fuzzy negative into something measurable.
How did the team know they had overcorrected?
They reviewed the output holistically rather than only checking that the forbidden behaviors were gone. Answers were now short and direct, satisfying every rule, yet some read as curt to frustrated users. Because they judged tone and not just compliance, they caught the regression that a shallow pass-fail check would have missed.
Why change only one constraint between round two and round three?
So they could attribute the result to a specific cause. By softening just the opening rule and leaving the rest untouched, they confirmed that the single change resolved the curtness without affecting the other improvements. Changing several at once would have made it impossible to tell which adjustment mattered.
What made the outcome measurable rather than just a feeling?
The team defined the problem upfront in countable terms, such as average word count and the presence of apologies, and kept a before sample. That let them compare before and after on hard numbers and supplement it with a manual review of the one dimension, tone, that resisted counting. The before sample is what turned impressions into evidence.
Key Takeaways
- Translate a vague complaint into countable symptoms before writing any constraint, since observable problems produce gradeable negatives.
- Expect a naive first attempt to fail on the fuzzy constraints and plan to rewrite them into measurable form.
- Aggressive negatives risk overcorrection, so build in a pass to catch tone or quality damage the rules did not intend.
- Change one constraint per iteration so each effect can be traced to its cause.
- Keep a before sample for an honest A/B comparison and save the final tested rules to a labeled library.