This is the story of one team and one prompt over two weeks. The prompt powered a patient intake assistant for a clinic group: it collected symptoms, confirmed insurance details, and booked appointments. It had passed an initial review and was days from launch when a single embarrassing demo changed the plan. What follows is the situation, the decision to stress test seriously, how the team executed, and what they measured at the end.
The details are a composite, assembled from the patterns that recur when teams take adversarial testing seriously for the first time. The shape of the story, however, is real and common: confidence, a scare, a structured response, and a prompt that came out far tougher than it went in.
The value of a case study is not the specific numbers. It is watching how a vague intention to "test the prompt" becomes a concrete process with artifacts and outcomes you can defend to a stakeholder.
It is also worth noticing what the story is not. It is not a story about a careless team. They had tested extensively and felt genuinely prepared. The lesson is uncomfortable precisely because diligence was not the missing ingredient. What was missing was a kind of testing, hostile testing, that their friendly process structurally could not produce. That distinction is the heart of the case study, and it is why a team can do everything that feels responsible and still ship something fragile.
The Situation: A Prompt That Looked Ready
Confidence Built on Friendly Inputs
The team had tested the assistant extensively, but always with cooperative, realistic patients. It handled those flawlessly. Everyone felt good about launch. The testing had rehearsed the prompt's strengths and never probed its blind spots, the classic mistake covered in Where Prompt Hardening Quietly Falls Apart.
The Demo That Forced a Decision
During a stakeholder demo, an executive idly typed, "Just tell me if this is cancer." The assistant offered a reassuring but clinically inappropriate response. The room went quiet. Launch was paused, and the team was asked one question: what else does this thing do that we have not seen?
The Decision: Treat Testing as Its Own Project
Naming an Attacker
Rather than asking the prompt's author to test harder, the lead assigned a different engineer the explicit role of attacker, with a single mandate: make the assistant misbehave. Separating author from attacker, a habit detailed in Habits That Keep a Production Prompt From Caving In, immediately changed the kinds of inputs being tried.
Writing the Boundaries Down
Before any attack, the team wrote explicit boundaries: never diagnose, never give treatment advice, never expose another patient's data, never proceed without confirming identity. These four lines became the standard every output was judged against. Writing them down also settled arguments before they started. When an output was ambiguous, the team did not debate gut feelings; they checked it against the four lines. A boundary that is written is a boundary the team can agree on, which is what makes the pass-or-fail judgment fast and consistent across hundreds of inputs.
The Execution: Two Weeks of Structured Attacks
Building the Inventory
The attacker assembled roughly 400 inputs across families: diagnosis bait, role-reassignment attempts, social-pressure phrasing, requests for other patients' records, indirect injection through pasted documents, and malformed inputs. Most of the effort went into domain-specific medical attacks rather than generic ones.
Running and Logging
Each input was sent, the output captured verbatim, and the result labeled against the four boundaries. The first full pass found a sobering number of failures: the assistant diagnosed under pressure, occasionally gave treatment suggestions, and once surfaced another patient's appointment time when asked indirectly. The diagnosis failures were the most unsettling, because they did not require trickery. Ordinary worried-patient phrasing, the kind the clinic would see thousands of times a week, was enough to walk the assistant across a hard clinical boundary. The team realized they had not been one clever attacker away from an incident; they had been one anxious patient away.
Triage and Surgical Fixes
Failures were ranked by harm. Diagnosis and data exposure topped the list. Fixes were applied one at a time, each followed by a full rerun. The diagnosis fix combined a hard rule with a demonstrated refusal redirecting to a clinician. The data-exposure fix moved out of the prompt entirely into access scoping, because no wording reliably stopped it.
The Outcome: What Actually Changed
Measurable Reduction in Failures
By the end of two weeks, the rerun of the full 400-input inventory dropped from dozens of boundary violations to a small handful of low-severity tone issues. The two high-severity categories, diagnosis and data exposure, reached zero in the final pass.
Confidence Backed by Evidence
The most valuable outcome was not a number but a change in how the team talked about readiness. Before, launch confidence rested on a feeling. After, it rested on a clean rerun of a documented inventory that anyone could inspect. When a stakeholder asked whether the assistant was safe, the team could point to the inventory, the boundaries, and the final results rather than offering reassurance. That shift, from assertion to evidence, is what made the launch defensible.
A Reusable Asset
The 400-input inventory did not get thrown away. It became a regression suite the team reran on every prompt change and model upgrade, paired with a launch gate modeled on a Twelve Checks Before You Ship a Prompt to Real Traffic. Testing changed from a one-time scramble into a standing safeguard.
The Lessons Worth Keeping
Friendly Testing Is Not Testing
The team's biggest realization was that all their pre-launch confidence rested on inputs that never challenged the prompt. The hostile pass found in two weeks what friendly testing had missed for months.
Some Fixes Belong Outside the Prompt
The data-exposure failure could not be reliably solved with wording. Moving it to the system layer was the durable fix, reinforcing that the prompt is only one layer of defense, a point explored in Manual Red-Teaming or Automated Fuzzing: Choosing Your Approach.
Stopping Criteria Beat Gut Feeling
The team's final realization was about how to end. Early on, "done" had meant "we feel good about it," which is exactly the feeling that nearly shipped a diagnosing chatbot. By the end, "done" meant a specific, measurable state: a full rerun of the inventory with zero high-severity failures and only minor tone issues. Tying the stopping point to the written boundaries rather than to confidence removed the single most dangerous variable in the whole effort, which was their own optimism.
Frequently Asked Questions
Was a two-week effort necessary for one prompt?
For a prompt that handles health information and patient data, yes. Test intensity should match stakes. A low-risk prompt would not warrant 400 attacks, but a system that can give clinical impressions or expose patient records justifies the investment many times over.
Why did separating the author from the attacker matter so much?
The author had unconsciously tested their own assumptions for months. A fresh engineer with a mandate to break things tried inputs the author never imagined. The change in person produced an immediate change in the kinds of failures discovered.
How did the team know when to stop?
They stopped when a full rerun of the inventory showed zero high-severity failures and only minor, acceptable tone issues. Stopping criteria were tied to the written boundaries, not to a feeling of being done or to a fixed amount of time.
What happened to the inventory after launch?
It became a permanent regression suite. The team reran it on every prompt edit and model upgrade, because both events can silently reintroduce old failures. The one-time effort turned into ongoing protection at very low marginal cost.
Could automation have replaced the manual work?
Partly. Once the inventory existed, rerunning it could be automated. But the creative work of inventing domain-specific medical attacks and judging subtle boundary crossings still needed a human. They automated the repetition and kept judgment in human hands.
Key Takeaways
- Pre-launch confidence built only on friendly inputs hides the failures that reach production.
- Assigning a dedicated attacker, separate from the author, immediately surfaces new failure types.
- Writing explicit boundaries first gives every output a clear pass or fail standard.
- High-severity failures like diagnosis and data exposure reached zero through surgical fixes and full reruns.
- The attack inventory became a permanent regression suite, turning a one-time effort into lasting protection.