Hardening an AI App Against Injection, One Step at a Time

Reading about prompt injection is one thing; actually hardening a live application is another. This walkthrough gives you an ordered sequence you can start working through this afternoon. Each step builds on the one before it, and the order matters—doing them out of sequence tends to waste effort on polish before the foundation is in place.

The process assumes you already have a working AI feature: a chatbot, an agent, a summarizer, something that takes input and produces output. We are going to make it defensible. You do not need to complete every step in one sitting, but you should not skip ahead, because later steps depend on decisions you make in earlier ones.

Work through these in order. By the end you will have an inventory of your exposure, a set of layered controls, and a repeatable test that tells you whether the controls hold.

Step 1: Inventory Every Untrusted Input

Before you defend anything, write down where outside text enters your system. You cannot protect inputs you have not identified.

List Your Sources

Open a document and list every channel that feeds text to your model: user chat messages, uploaded files, retrieved web pages, database fields, API responses, tool outputs. For each one, note who controls the content. If anyone other than you can influence it, mark it untrusted.

Flag the Surprises

Pay special attention to sources you instinctively trusted—an internal knowledge base, a partner's API, comments in a code repository. If those can be edited by people outside your direct control, they are untrusted too. This list becomes the map for everything that follows.

Step 2: Map What the Model Can Do

Now list every action your model can trigger, directly or through tools.

Rate Each Action by Consequence

For each capability—reading a record, sending a message, modifying data, making a payment—rate how bad it would be if an attacker triggered it. Sort them from harmless to catastrophic. This ranking tells you where to spend your strongest defenses.

Cut Anything You Do Not Need

If the model has a capability it does not actually require for its job, remove it now. The fastest way to prevent a dangerous action is to make it impossible. Reducing the capability list is the highest-leverage move in the entire process.

Step 3: Separate Privilege From Untrusted Input

This is the architectural heart of the work. The goal is that no model exposed to untrusted text can perform a high-consequence action on its own.

Add a Confirmation or Verification Gate

For every action you rated as serious in Step 2, insert a checkpoint: a human approval, or a separate model call that only sees trusted input and decides whether the action is allowed. The contaminated path can propose; only a clean path can approve.

Split Read and Act Roles

Where possible, design two distinct stages. One stage reads and processes untrusted content but has no power to act. A second stage takes only validated, structured results from the first and is the only place where actions happen. The injection can travel only as far as the structured handoff allows.

Step 4: Frame and Isolate Untrusted Text

With the architecture in place, tighten how content reaches the model.

Wrap Data in Explicit Delimiters

Surround untrusted content with clear markers and state in your system prompt that everything inside those markers is data to analyze, never commands to obey. Restate the trust hierarchy: your instructions win over anything in the data block.

Avoid Blending Roles in One Message

Keep system instructions, user input, and retrieved data in clearly distinct segments rather than mashed into a single paragraph. Structure gives both the model and your downstream checks a boundary to work with.

Step 5: Constrain and Validate the Output

Decide what a valid response looks like, then enforce it.

Use Structured Outputs

Where the response feeds into code, require a defined schema or a fixed set of allowed values. Reject anything that does not match before acting on it. A hijacked answer that does not fit the expected shape gets stopped here.

Allowlist Tool Calls

If the model can call tools, validate every requested call against an allowlist of permitted actions and argument ranges. A request to a tool or parameter outside the allowlist is blocked and logged, not executed.

Step 6: Add Detection and Logging

Assume some attempts will slip through and make them visible.

Log Every Tool Call and Decision

Record what the model tried to do, with what arguments, in response to what input. These logs turn silent compromises into investigable events and give you the data to improve.

Run a Second-Pass Classifier

Add a lightweight check—a classifier or a separate model call—that scans incoming content and model output for instruction-override patterns. Use it to raise alerts, not as your only line of defense.

Step 7: Red-Team and Re-Test

A control you have not attacked is unproven. Finish by trying to break your own work.

Build an Adversarial Test Set

Assemble a collection of injection attempts: override phrasings, encoded payloads, role-play framings, and multi-document attacks. Run them against your app and treat any success as a failing test that must be fixed before release.

Re-Run on Every Change

Add the test set to your continuous checks so it runs whenever you change prompts, tools, or models. A model upgrade can quietly reopen a closed hole, and only repeated testing catches that.

Putting the Steps in the Right Order

The sequence above is deliberate, and it helps to understand why so you do not optimize the wrong thing first.

Architecture Before Polish

Steps one through three—inventory, capability mapping, and privilege separation—are foundational because they determine the worst-case outcome of any future injection. If you jump straight to writing clever delimiters or tuning a detection classifier while leaving a model wired directly to a payment action, you have polished the surface of a structurally unsound system. Always settle the architecture before refining the details, because the details only matter once the foundation contains the damage.

Detection Comes After Containment

It is tempting to start with a detection classifier because it feels like active defense. But detection is a tripwire, not a wall, and a tripwire in front of an undefended action just tells you about the disaster as it happens. Build containment first—separation, validation, gates—so that when detection does fire, the thing it failed to stop was survivable anyway.

Keeping the Defense Alive Over Time

A hardening pass is not a finish line. The work has a maintenance phase that determines whether your effort holds up.

Schedule Periodic Reviews

Put a recurring reminder on the calendar to revisit your input inventory and capability map. Systems accumulate new integrations and new tools over time, and each addition can quietly introduce an untrusted source or a dangerous capability that bypasses your existing controls. A quarterly walk-through catches drift before it becomes exposure.

Grow the Attack Corpus Deliberately

Make a habit of adding every new injection technique you encounter—from research, from public incidents, from your own experiments—into the adversarial suite. The corpus should always be growing, because a test set frozen at launch slowly loses relevance as attackers innovate. A living suite is what keeps your continuous testing meaningful.

For the reasoning behind each layer, The Complete Guide to Prompt Injection Defense goes deeper, Prompt Injection Defense: Best Practices That Actually Work sharpens the judgment calls, and The Best Tools for Prompt Injection Defense points you to software that supports the testing and detection steps.

Frequently Asked Questions

How long does this process take?

The architectural steps—inventory, capability mapping, privilege separation—can take a few days for a moderate app. Framing, validation, and detection are faster once the architecture is set. Red-teaming is ongoing rather than one-and-done.

Can I skip the inventory if I think I already know my inputs?

Do not skip it. Teams almost always discover an untrusted source they had mentally filed as safe—an editable wiki, a third-party feed. The inventory is cheap insurance against defending the wrong things.

What if I cannot remove a high-risk capability?

Then it must go behind a confirmation or verification gate from Step 3. If a capability is both dangerous and necessary, the answer is never to leave it exposed to untrusted input without a check.

Do I need all seven steps for a simple internal tool?

Inventory, capability mapping, and privilege separation apply even to small tools. The detection and red-teaming steps scale with risk—do as much as the consequences of failure justify.

Key Takeaways

Start by inventorying every untrusted input and mapping every action the model can take—you cannot defend what you have not listed.
Cutting unnecessary capabilities is the fastest, highest-leverage protection available.
Privilege separation, where contaminated paths cannot act without a clean approval, is the architectural core of the defense.
Frame untrusted text with delimiters, constrain outputs with schemas and allowlists, and log every action for visibility.
Finish with an adversarial test set that runs on every change, because model upgrades can reopen closed holes.

Work through these in order. By the end you will have an inventory of your exposure, a set of layered controls, and a repeatable test that tells you whether the controls hold.

Step 1: Inventory Every Untrusted Input

Before you defend anything, write down where outside text enters your system. You cannot protect inputs you have not identified.

List Your Sources

Flag the Surprises

Step 2: Map What the Model Can Do

Now list every action your model can trigger, directly or through tools.

Rate Each Action by Consequence

Cut Anything You Do Not Need

Step 3: Separate Privilege From Untrusted Input

This is the architectural heart of the work. The goal is that no model exposed to untrusted text can perform a high-consequence action on its own.

Add a Confirmation or Verification Gate

Split Read and Act Roles

Step 4: Frame and Isolate Untrusted Text

With the architecture in place, tighten how content reaches the model.

Wrap Data in Explicit Delimiters

Avoid Blending Roles in One Message

Step 5: Constrain and Validate the Output

Decide what a valid response looks like, then enforce it.

Use Structured Outputs

Allowlist Tool Calls

Step 6: Add Detection and Logging

Assume some attempts will slip through and make them visible.

Log Every Tool Call and Decision

Record what the model tried to do, with what arguments, in response to what input. These logs turn silent compromises into investigable events and give you the data to improve.

Run a Second-Pass Classifier

Step 7: Red-Team and Re-Test

A control you have not attacked is unproven. Finish by trying to break your own work.

Build an Adversarial Test Set

Re-Run on Every Change

Add the test set to your continuous checks so it runs whenever you change prompts, tools, or models. A model upgrade can quietly reopen a closed hole, and only repeated testing catches that.

Putting the Steps in the Right Order

The sequence above is deliberate, and it helps to understand why so you do not optimize the wrong thing first.

Architecture Before Polish

Detection Comes After Containment

Keeping the Defense Alive Over Time

A hardening pass is not a finish line. The work has a maintenance phase that determines whether your effort holds up.

Schedule Periodic Reviews

Grow the Attack Corpus Deliberately

Frequently Asked Questions

How long does this process take?

Can I skip the inventory if I think I already know my inputs?

What if I cannot remove a high-risk capability?

Then it must go behind a confirmation or verification gate from Step 3. If a capability is both dangerous and necessary, the answer is never to leave it exposed to untrusted input without a check.

Do I need all seven steps for a simple internal tool?

Inventory, capability mapping, and privilege separation apply even to small tools. The detection and red-teaming steps scale with risk—do as much as the consequences of failure justify.

Key Takeaways

Start by inventorying every untrusted input and mapping every action the model can take—you cannot defend what you have not listed.
Cutting unnecessary capabilities is the fastest, highest-leverage protection available.
Privilege separation, where contaminated paths cannot act without a clean approval, is the architectural core of the defense.
Frame untrusted text with delimiters, constrain outputs with schemas and allowlists, and log every action for visibility.
Finish with an adversarial test set that runs on every change, because model upgrades can reopen closed holes.

Hardening an AI App Against Injection, One Step at a Time

Step 1: Inventory Every Untrusted Input

List Your Sources

Flag the Surprises

Step 2: Map What the Model Can Do

Rate Each Action by Consequence

Cut Anything You Do Not Need

Step 3: Separate Privilege From Untrusted Input

Add a Confirmation or Verification Gate

Split Read and Act Roles

Step 4: Frame and Isolate Untrusted Text

Wrap Data in Explicit Delimiters

Avoid Blending Roles in One Message

Step 5: Constrain and Validate the Output

Use Structured Outputs

Allowlist Tool Calls

Step 6: Add Detection and Logging

Log Every Tool Call and Decision

Run a Second-Pass Classifier

Step 7: Red-Team and Re-Test

Build an Adversarial Test Set

Re-Run on Every Change

Putting the Steps in the Right Order

Architecture Before Polish

Detection Comes After Containment

Keeping the Defense Alive Over Time

Schedule Periodic Reviews

Grow the Attack Corpus Deliberately

Frequently Asked Questions

How long does this process take?

Can I skip the inventory if I think I already know my inputs?

What if I cannot remove a high-risk capability?

Do I need all seven steps for a simple internal tool?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Hardening an AI App Against Injection, One Step at a Time

Step 1: Inventory Every Untrusted Input

List Your Sources

Flag the Surprises

Step 2: Map What the Model Can Do

Rate Each Action by Consequence

Cut Anything You Do Not Need

Step 3: Separate Privilege From Untrusted Input

Add a Confirmation or Verification Gate

Split Read and Act Roles

Step 4: Frame and Isolate Untrusted Text

Wrap Data in Explicit Delimiters

Avoid Blending Roles in One Message

Step 5: Constrain and Validate the Output

Use Structured Outputs

Allowlist Tool Calls

Step 6: Add Detection and Logging

Log Every Tool Call and Decision

Run a Second-Pass Classifier

Step 7: Red-Team and Re-Test

Build an Adversarial Test Set

Re-Run on Every Change

Putting the Steps in the Right Order

Architecture Before Polish

Detection Comes After Containment

Keeping the Defense Alive Over Time

Schedule Periodic Reviews

Grow the Attack Corpus Deliberately

Frequently Asked Questions

How long does this process take?

Can I skip the inventory if I think I already know my inputs?

What if I cannot remove a high-risk capability?

Do I need all seven steps for a simple internal tool?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?