The clearest way to understand prompt injection defense is to follow one team through a real fix from start to finish. This case study traces a composite scenario drawn from the way these incidents actually unfold: an AI agent that worked beautifully in demos, quietly carried a serious vulnerability into production, got exploited, and was rebuilt to be defensible. The names and specifics are generalized, but the arc—situation, discovery, decision, execution, outcome—reflects the genuine shape of this work.
Read it as a template. The team's reasoning at each fork is more valuable than the particular tools they reached for, because your system will differ in detail while facing the same fundamental choices.
What follows is the situation they inherited, the moment they realized something was wrong, the decisions they debated, what they actually built, and the results they could measure afterward.
The Situation
A mid-sized software company built an internal AI assistant to help its operations team. The assistant could read tickets from the support queue, look up customer records, draft responses, and—the feature everyone loved—automatically apply small account adjustments like extending a trial or issuing a modest credit.
How It Was Built
The whole thing ran as a single model loop. The assistant read the ticket, pulled relevant records, decided on an action, and executed it. The system prompt instructed it to apply credits only up to a small limit and to escalate anything larger. In demos and early use, it worked flawlessly and saved the team real time.
The Latent Flaw
Tickets are written by customers. They are untrusted input. And the same model that read those tickets also held the authority to adjust accounts. Untrusted input and a powerful action lived in one undivided loop, protected only by an instruction in the prompt. Nobody had framed it that way during the build.
The Discovery
The problem surfaced when an analyst noticed a cluster of unusually large credits applied to several accounts over a single weekend.
Tracing the Incident
Reviewing the logs—which, fortunately, captured the assistant's actions—the team found that the affected tickets all contained a similar passage: text instructing the assistant to disregard its credit limit and apply a large credit "as previously authorized by the account manager." The model had read the instruction inside the ticket and followed it, treating customer-supplied text as a command.
The Realization
This was textbook indirect prompt injection. The attackers never accessed any system directly. They simply submitted support tickets, and the assistant did the rest. The credit limit in the prompt had been worthless because the same text channel carrying the data also carried the override.
The Decision
Under pressure to restore the feature safely, the team debated three paths.
The Options on the Table
The first option was to harden the prompt with stronger wording forbidding overrides. The second was to add a keyword filter that blocked tickets mentioning credits and authorization. The third, more invasive, was to re-architect so the model reading tickets could no longer apply adjustments at all.
Why They Chose Re-Architecture
The team recognized that the first two options treated symptoms. Prompt wording could be paraphrased past, and keyword filters could be evaded by rephrasing. Only the structural change addressed the root cause—untrusted input wired directly to a powerful action. They accepted the larger effort in exchange for a defense that did not depend on outguessing attackers.
The Execution
The rebuild centered on privilege separation, with supporting layers around it.
Splitting Read From Act
They divided the assistant into two stages. A reading stage processed tickets and produced a structured recommendation—action type, amount, and a justification—but had no power to execute anything. A separate acting stage took only that structured recommendation, never the raw ticket text, and applied the action against hard-coded limits enforced in code rather than in a prompt.
Adding Validation and Gates
Any recommended credit above the small limit was routed to a human queue, enforced by the acting stage's code regardless of what the recommendation claimed. Outputs from the reading stage had to pass schema validation, so malformed or out-of-range values were rejected outright. The team also kept and expanded the action logging that had made the incident traceable in the first place.
Red-Teaming Before Relaunch
Before turning the feature back on, they assembled a set of injection attempts modeled on the original attack plus variations—encoded payloads, different phrasings, instructions split across multiple tickets—and confirmed that none could push an action past the code-enforced limits.
The Outcome
The rebuilt assistant returned to production with measurably different properties.
What Changed Measurably
Unauthorized adjustments dropped to zero in the months after relaunch, because the limit was now enforced in code that untrusted input could not reach. The adversarial test suite, run on every change, caught two regressions during a later model upgrade before they shipped. The action logs, now standard practice, cut incident investigation time from days to hours.
The Lessons That Generalized
The team's takeaway was that the original feature had not been insecurely worded—it had been insecurely structured. No amount of prompt cleverness would have fixed a design that fused untrusted input with a powerful action. The durable fix was architectural, and it made future incidents survivable rather than catastrophic.
How the Team Changed Its Process
The incident reshaped more than one feature. It changed how the team built every AI capability that followed.
A New Design Checkpoint
The team added a standing question to every AI feature design review: does this component read untrusted content, and if so, what is the worst action it can take on its own? Any feature that combined the two had to justify a containment plan before it could ship. This converted the painful lesson into a repeatable gate rather than relying on anyone remembering the incident.
Logging and Testing Became Defaults
Action logging, which had been an afterthought that happened to save them, became a non-negotiable requirement for any feature that could take an action. The adversarial test suite became a shared asset that every new feature contributed to and ran against. What had been one team's hard-won fix turned into the organization's default posture.
What Other Teams Can Borrow
The specifics of this case—support tickets, account credits—are particular, but the reasoning transfers directly to almost any AI feature.
Find Your Version of the Same Flaw
Most AI applications have a place where untrusted content and a consequential action meet. It might be tickets and credits, or documents and approvals, or messages and outbound email. The exercise is to locate that meeting point in your own system and ask whether anything but a prompt instruction stands between them. If the answer is no, you have found your version of this incident before it happens.
Apply the Same Sequence of Fixes
The team's path—separate reading from acting, enforce limits in code, validate the handoff, gate high stakes to humans, and confirm with adversarial testing—is a template you can follow regardless of domain. The order matters: containment first, then detection and testing around it. Borrowing the sequence is more valuable than borrowing the particular tools, because your tools will differ while the structure stays the same.
This narrative puts the principles from Prompt Injection Defense: Best Practices That Actually Work into motion, follows the build order in A Step-by-Step Approach to Prompt Injection Defense, and avoids the traps catalogued in 7 Common Mistakes with Prompt Injection Defense (and How to Avoid Them).
Frequently Asked Questions
Could a better-written prompt have prevented this incident?
No. The attack worked precisely because prompt instructions are suggestions the model can be talked out of. A stronger limit in the prompt would have been bypassed by rephrasing. Only enforcing the limit in code, outside the model's reach, closed the hole.
Why was logging so important to the response?
The action logs were what let the team trace the incident to its source and understand the attack within hours instead of guessing for days. Without them, the cluster of large credits would have been far harder to explain. Logging turns silent compromises into investigable events.
Was the re-architecture worth the extra effort over a quick patch?
Yes. The quick patches—prompt hardening and keyword filtering—would have failed against a motivated attacker and given false confidence. The structural fix eliminated the root cause and made the system resilient to attack variations the team had not anticipated.
How did they know the fix actually worked?
They built an adversarial test suite based on the real attack plus variations and confirmed none could push an action past the code-enforced limits. Running that suite continuously also caught two later regressions during a model upgrade.
Key Takeaways
- The assistant was insecurely structured, not insecurely worded—untrusted ticket text was wired directly to a powerful action.
- A credit limit living in the prompt was worthless because the same channel that carried data carried the override.
- The durable fix was privilege separation: a reading stage with no power, and an acting stage enforcing limits in code on validated input.
- Action logging made the incident traceable in hours and became standard practice afterward.
- A continuous adversarial test suite confirmed the fix and later caught two regressions during a model upgrade.