Most teams discover prompt injection the hard way: a customer pastes a webpage into a chatbot, the page contains hidden instructions, and the bot dutifully leaks a system prompt or calls a tool it should never have touched. The uncomfortable truth is that you cannot prompt your way out of this with a single clever sentence. Defense is a stack of small, boring controls that each close one door.
This checklist is meant to be used, not admired. Run it against any feature that feeds untrusted text—user input, retrieved documents, web pages, emails—into a language model. Each item has a one-line reason so you can decide whether it applies to your situation rather than cargo-culting the whole list.
If you are new to the topic, start with Prompt Injection Defense: A Beginner's Guide and then come back here to operationalize what you learned.
Before You Build
Define the trust boundary
- Write down what counts as untrusted input. Anything you did not author—user messages, RAG documents, tool outputs, scraped pages—is hostile until proven otherwise. If you cannot name the boundary, you cannot defend it.
- Decide what the model is allowed to do. A model that can only return text is a smaller target than one that can send email or run SQL. Scope the capabilities before scoping the prompt.
- Identify the worst realistic outcome. Data exfiltration, unauthorized tool calls, and reputational damage all need different controls. Name the failure you are actually trying to prevent.
Separate instructions from data
- Never concatenate untrusted text directly into your instruction block. Mixing the two is the root cause of most injections. Keep system guidance and user-supplied content in clearly different positions.
- Delimit untrusted content explicitly. Wrap retrieved or user text in markers and tell the model that everything inside is data to analyze, not commands to follow.
At the Prompt Layer
Harden the system prompt
- State that instructions inside user content must be ignored. This is weak on its own but raises the cost of trivial attacks.
- Avoid putting secrets in the system prompt. If leaking the prompt would hurt, assume it will leak. Treat the prompt as public.
- Pin the output format. A model constrained to return JSON with a fixed schema has fewer ways to go off-script than one writing free prose.
Constrain the model's reach
- Use an allowlist for tools, not a denylist. Enumerate exactly which tools the model can call in which states. Everything else is denied by default.
- Require human or deterministic approval for irreversible actions. Sending money, deleting records, or emailing customers should pass through a gate the model cannot bypass.
For a structured way to think about these layers together, see A Framework for Prompt Injection Defense.
At the System Layer
Validate inputs and outputs
- Scan inputs for known injection patterns. Pattern matching catches the lazy attacks and frees your attention for the clever ones.
- Validate model output before acting on it. If the model returns a tool call, confirm the arguments are within expected bounds before execution. Never trust the model to police itself.
- Strip or neutralize active content in retrieved documents. HTML comments, invisible Unicode, and markdown link tricks are common carriers for hidden instructions.
Enforce least privilege
- Run tool calls with the user's permissions, not the agent's. If the agent can read every customer record, an injection can too. Scope credentials down to the requesting user.
- Rate-limit and log every tool invocation. A sudden spike in tool calls is often the first visible sign of a successful injection.
Monitor and respond
- Log full prompts and completions for high-risk flows. You cannot investigate what you did not record. Redact secrets but keep enough to reconstruct an attack.
- Alert on anomalies, not just errors. Unexpected tool sequences and unusual output lengths are signals worth watching.
- Have a kill switch. Be able to disable a tool or an entire agent without a deploy. Speed matters more than elegance during an incident.
During Operation
Treat retrieval as a live threat surface
Retrieval-augmented features are where indirect injection thrives, because the model reads content nobody approved at request time. A support agent that pulls from a knowledge base, a research assistant that browses the web, and a summarizer that ingests uploaded files all share the same exposure: hostile instructions can ride in on the very documents the feature exists to process.
- Re-establish trust at every fetch. A document your agent retrieves may link to a page it then fetches automatically. Do not assume a second-hop source is safer than a first-hop one; it is often the opposite, because attackers hide there expecting weaker scrutiny.
- Bound retrieval depth and breadth. An agent that follows links indefinitely will eventually follow a poisoned one. Cap how far and how wide it roams, and prefer curated sources over open browsing wherever the use case allows.
- Normalize before you trust. Decode encodings, strip zero-width characters, and canonicalize Unicode before any content reaches the model or your scanners. A payload written in homoglyphs or base64 slips past filters tuned for plain English.
Plan the incident before it happens
The teams that recover fastest from an injection are the ones who decided in advance who does what. An incident is a bad time to discover that nobody can disable a tool without a deploy or that your logs lack the context to reconstruct the attack.
- Name an owner for AI security incidents. Ambiguity about who responds costs hours you do not have during a live attack.
- Write the runbook now. Document how to disable a tool, revoke a credential, and roll back a release, so the response is muscle memory rather than improvisation.
- Rehearse with a tabletop. Walk through a realistic injection scenario once a quarter so the runbook is tested before reality tests it.
Before You Ship
Test like an attacker
- Maintain a red-team prompt suite. Collect injection payloads and run them on every release. A defense that is not tested is a hope, not a control.
- Test the indirect path, not just direct input. Most real attacks arrive through documents and web content, not the chat box.
- Track your block rate over time. If you cannot measure whether defenses improved, you are guessing. See How to Measure Prompt Injection Defense: Metrics That Matter for how to instrument this.
Right-size the rigor
Not every item here applies at full strength to every feature. A read-only summarizer with no tools and no sensitive data needs separation and basic logging but little else. An agent that can move money or email customers needs the entire list at maximum intensity. The fastest way to waste effort is to apply payment-grade controls to a feature whose worst outcome is a slightly odd sentence.
- Set the bar from the worst realistic outcome. Let the damage a successful injection could cause decide how many items you enforce and how hard.
- Document what you skipped and why. A deliberate, recorded decision to omit a control is defensible. A control nobody considered is the gap that gets exploited.
- Re-evaluate on every capability change. The moment a feature gains a tool or a new data source, its worst outcome shifts and items you safely skipped may now be mandatory.
Frequently Asked Questions
Is a hardened system prompt enough on its own?
No. Instructional defenses raise the cost of casual attacks but are routinely bypassed by determined ones. They belong in your stack as one layer among several, never as the only line of defense. The durable controls live at the system layer: least privilege, output validation, and tool gating.
How often should I run this checklist?
Treat it as a release gate. Run the build-time and ship-time items on every change to a model-facing feature, and re-run the full list whenever you add a tool, change the trust boundary, or integrate a new data source. Quarterly audits catch drift that incremental reviews miss.
What is the single highest-leverage item here?
Least privilege on tool calls. Even if an injection fully hijacks the model, scoping credentials to the requesting user and gating irreversible actions limits the blast radius to what that user could already do. It turns a breach into an annoyance.
Does this apply to internal-only tools?
Yes, with adjusted severity. Internal agents still ingest untrusted documents and tickets, and insiders can be compromised. The trust boundary shifts but does not disappear, so keep input separation and tool gating even when the audience is your own staff.
Key Takeaways
- Prompt injection defense is a stack of small controls, not one magic instruction.
- Separating untrusted data from instructions is the foundational move; everything else builds on it.
- System-layer controls—least privilege, output validation, tool gating—outlast any prompt-level trick.
- Treat the checklist as a release gate and re-run it whenever the trust boundary or tool set changes.
- You cannot improve what you do not measure, so pair this list with a metrics program and a red-team suite.