Your 2026 AI Sandbox Readiness Checklist

A checklist is only useful if you actually run it, and you will only run it if it is short enough to be practical and justified enough to be trusted. This one is built to be both. Each item earns its place with a one-line reason, so you are never ticking a box you do not understand.

Use this before any unattended agent run, before promoting a sandbox configuration, and on a recurring schedule to catch the quiet erosion that affects every long-lived environment. It is organized into four groups: isolation, data, control, and verification. Run them in order; the later groups assume the earlier ones pass.

For the reasoning behind these practices in depth, pair this with our best practices guide. Here, we keep it tight and actionable.

Isolation checks

These confirm the walls exist before anything runs inside them.

Execution is contained. The agent runs in a disposable container or microVM, not on a shared host. Why: a buggy or hostile process should harm only its own throwaway space.
Boundary matches trust level. Untrusted, AI-generated code gets a microVM; trusted internal code can use a container. Why: stronger isolation costs more, so spend it where the risk lives.
Outbound network is denied by default. All external access is closed except an explicit allowlist. Why: an open network is the most common and most damaging hole.
Internal hostnames and metadata endpoints are blocked. The agent cannot reach internal services or the cloud metadata endpoint. Why: these are classic exfiltration and privilege-escalation paths.

If any isolation item fails, stop. Nothing below matters until the walls are real.

Data checks

These confirm that what is inside the box is safe to lose.

No raw production data is present. The sandbox holds only synthetic or masked records. Why: a leak should expose nothing real.
Masking is verified, not assumed. You have confirmed sensitive fields are actually replaced, not merely intended to be. Why: half-masked data is unmasked data where it counts.
Embedded references are scrubbed. Records contain no internal URLs, real hostnames, or live tokens. Why: an agent can act on a URL buried in a data field.
Data fidelity matches the test. Realism is high enough to surface real failures, low enough to stay safe. Why: too-clean data hides edge cases; too-real data raises the cost of a leak.

Control checks

These confirm the agent can only do what you intend, at a bounded cost.

Permissions

Least privilege is enforced. The agent has only the tools the task requires and nothing more. Why: every extra capability is another escape route for a mistake.
Dangerous actions are mocked. High-stakes capabilities like payments are simulated, not live. Why: you want to observe intent without consequence.

Limits

Token spend is capped. A hard ceiling stops runaway loops. Why: a looping agent can burn a startling bill overnight.
Action rate is limited. The agent cannot hammer a tool or API unbounded. Why: rate limits are a circuit breaker, not a tuning knob.

The reasoning behind mocking dangerous actions is illustrated well in our real-world examples.

Verification checks

These confirm the whole thing actually works, rather than appearing to.

Observability is on from the first run. Every prompt, tool call, command, and output is logged. Why: it is both your debugger and your audit trail, and you cannot add it retroactively.
Adversarial breakout was attempted. You instructed an agent to reach a forbidden endpoint and persist a file past teardown, and both failed. Why: a wall you have not tested is a wall you are only hoping holds.
Teardown is clean. Destroying and recreating the sandbox leaves no residual state. Why: leftover state contaminates the next run and can leave credentials behind.
Checks are scheduled, not one-time. This whole list runs on a recurring basis, not just at setup. Why: isolation erodes quietly as people add allowlist entries for convenience.

If a verification check fails, you have found a hole while it is still cheap. That is the checklist doing its job. For the failure modes these catch, see our common mistakes guide, and for the full conceptual grounding, the complete guide.

How to actually use this checklist

Do not treat this as a one-time setup ritual. The value is in repetition.

Run the full list before any unattended overnight agent run.
Run the isolation and verification groups after every configuration change, since those are where drift creeps in.
Run the whole thing on a calendar schedule, monthly at minimum, regardless of whether anything seems wrong.

The teams that get burned are almost never the ones who lacked a checklist. They are the ones who ran it once, at setup, and assumed it stayed true. It does not. Make the checklist a habit, not an event.

Turning the checklist into a gate

A checklist that lives in a document gets read; a checklist that lives in your pipeline gets enforced. The strongest move is to convert as many of these items as possible into automated gates that block an unattended run when they fail.

Automate the isolation checks. A script can confirm that outbound networking is denied and that the metadata endpoint is unreachable, then refuse to launch the agent if either is open. This removes human forgetfulness from the most dangerous item.
Automate the adversarial breakout. The breakout test can run as part of provisioning, so every fresh sandbox proves its own walls before any real task touches it. A sandbox that cannot pass its own breakout test never gets used.
Automate the spend cap. Caps enforced in code cannot be skipped under deadline pressure the way a manual check can. Wire them into the environment itself, not into someone's memory.

The items that resist automation, judging whether data fidelity matches the test, deciding whether a capability should be mocked, are exactly the ones worth a human's deliberate attention. Automate the mechanical checks so people can spend their judgment where judgment is actually required.

Adapting the checklist to your context

This list is a baseline, not a ceiling. Three adjustments make it fit your situation.

First, weight the groups by your dominant risk. A team running untrusted generated code should expand the isolation group; a team running expensive autonomous loops should expand the control group. Add items where your specific danger lives.

Second, capture your own near-misses as new checklist entries. Every time your sandbox catches something the list did not anticipate, add the item that would have caught it sooner. The checklist should grow from your scars, not just from this article.

Third, keep it short enough to actually run. A checklist nobody completes protects nothing. If yours grows unwieldy, automate the mechanical items rather than deleting them, so the human-facing list stays runnable. Our framework article helps you decide which dimensions deserve the most attention.

Frequently Asked Questions

Which group should I never skip under time pressure?

Isolation, every time. The other groups assume the walls hold; if they do not, nothing else protects you. When you are rushed, run the four isolation checks at minimum and defer the rest only if you genuinely must. An uncontained run is the one to refuse.

How is this different from a generic security checklist?

This list targets the specific ways AI agents fail: looping spend, generated-code execution, and acting on data they were given. Generic security checklists miss these because they predate autonomous agents. The verification group in particular, adversarial breakout testing, is tailored to agentic risk.

How long does running the full checklist take?

After the first time, most items are quick to confirm if you have scripted them, on the order of minutes. The adversarial breakout test takes the longest because you actually run an agent against the walls. The investment is upfront; recurring runs are fast.

What if I cannot mask my data realistically enough?

Then lower your test fidelity rather than reaching for raw data. A less realistic but fully safe synthetic set is almost always the right trade. If a test genuinely requires production realism, that test belongs in a more controlled staging environment, not your sandbox.

Should non-technical team members run this checklist?

The isolation and control setup is usually an engineering responsibility, but the verification group benefits from a second set of eyes. Anyone can confirm that an adversarial breakout test failed and that logs are present. Shared verification reduces the chance that one person's blind spot becomes a leak.

Key Takeaways

Run the checklist in order: isolation first, then data, control, and verification, since later groups assume earlier ones pass.
Never skip the isolation group; if the walls do not hold, no other check protects you.
Confirm data is masked and scrubbed in fact, not just in intent, and match fidelity to the test.
Enforce least privilege, mock dangerous actions, and cap both spend and action rate before unattended runs.
The checklist's value is in repetition; run it on a schedule, not once at setup, because isolation erodes quietly.

For the reasoning behind these practices in depth, pair this with our best practices guide. Here, we keep it tight and actionable.

Isolation checks

These confirm the walls exist before anything runs inside them.

Execution is contained. The agent runs in a disposable container or microVM, not on a shared host. Why: a buggy or hostile process should harm only its own throwaway space.
Boundary matches trust level. Untrusted, AI-generated code gets a microVM; trusted internal code can use a container. Why: stronger isolation costs more, so spend it where the risk lives.
Outbound network is denied by default. All external access is closed except an explicit allowlist. Why: an open network is the most common and most damaging hole.
Internal hostnames and metadata endpoints are blocked. The agent cannot reach internal services or the cloud metadata endpoint. Why: these are classic exfiltration and privilege-escalation paths.

If any isolation item fails, stop. Nothing below matters until the walls are real.

Data checks

These confirm that what is inside the box is safe to lose.

No raw production data is present. The sandbox holds only synthetic or masked records. Why: a leak should expose nothing real.
Masking is verified, not assumed. You have confirmed sensitive fields are actually replaced, not merely intended to be. Why: half-masked data is unmasked data where it counts.
Embedded references are scrubbed. Records contain no internal URLs, real hostnames, or live tokens. Why: an agent can act on a URL buried in a data field.
Data fidelity matches the test. Realism is high enough to surface real failures, low enough to stay safe. Why: too-clean data hides edge cases; too-real data raises the cost of a leak.

Control checks

These confirm the agent can only do what you intend, at a bounded cost.

Permissions

Least privilege is enforced. The agent has only the tools the task requires and nothing more. Why: every extra capability is another escape route for a mistake.
Dangerous actions are mocked. High-stakes capabilities like payments are simulated, not live. Why: you want to observe intent without consequence.

Limits

Token spend is capped. A hard ceiling stops runaway loops. Why: a looping agent can burn a startling bill overnight.
Action rate is limited. The agent cannot hammer a tool or API unbounded. Why: rate limits are a circuit breaker, not a tuning knob.

The reasoning behind mocking dangerous actions is illustrated well in our real-world examples.

Verification checks

These confirm the whole thing actually works, rather than appearing to.

Observability is on from the first run. Every prompt, tool call, command, and output is logged. Why: it is both your debugger and your audit trail, and you cannot add it retroactively.
Adversarial breakout was attempted. You instructed an agent to reach a forbidden endpoint and persist a file past teardown, and both failed. Why: a wall you have not tested is a wall you are only hoping holds.
Teardown is clean. Destroying and recreating the sandbox leaves no residual state. Why: leftover state contaminates the next run and can leave credentials behind.
Checks are scheduled, not one-time. This whole list runs on a recurring basis, not just at setup. Why: isolation erodes quietly as people add allowlist entries for convenience.

How to actually use this checklist

Do not treat this as a one-time setup ritual. The value is in repetition.

Run the full list before any unattended overnight agent run.
Run the isolation and verification groups after every configuration change, since those are where drift creeps in.
Run the whole thing on a calendar schedule, monthly at minimum, regardless of whether anything seems wrong.

Turning the checklist into a gate

Automate the isolation checks. A script can confirm that outbound networking is denied and that the metadata endpoint is unreachable, then refuse to launch the agent if either is open. This removes human forgetfulness from the most dangerous item.
Automate the adversarial breakout. The breakout test can run as part of provisioning, so every fresh sandbox proves its own walls before any real task touches it. A sandbox that cannot pass its own breakout test never gets used.
Automate the spend cap. Caps enforced in code cannot be skipped under deadline pressure the way a manual check can. Wire them into the environment itself, not into someone's memory.

Adapting the checklist to your context

This list is a baseline, not a ceiling. Three adjustments make it fit your situation.

Frequently Asked Questions

Which group should I never skip under time pressure?

How is this different from a generic security checklist?

How long does running the full checklist take?

What if I cannot mask my data realistically enough?

Should non-technical team members run this checklist?

Key Takeaways

Run the checklist in order: isolation first, then data, control, and verification, since later groups assume earlier ones pass.
Never skip the isolation group; if the walls do not hold, no other check protects you.
Confirm data is masked and scrubbed in fact, not just in intent, and match fidelity to the test.
Enforce least privilege, mock dangerous actions, and cap both spend and action rate before unattended runs.
The checklist's value is in repetition; run it on a schedule, not once at setup, because isolation erodes quietly.

Your 2026 AI Sandbox Readiness Checklist

Isolation checks

Data checks

Control checks

Permissions

Limits

Verification checks

How to actually use this checklist

Turning the checklist into a gate

Adapting the checklist to your context

Frequently Asked Questions

Which group should I never skip under time pressure?

How is this different from a generic security checklist?

How long does running the full checklist take?

What if I cannot mask my data realistically enough?

Should non-technical team members run this checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Your 2026 AI Sandbox Readiness Checklist

Isolation checks

Data checks

Control checks

Permissions

Limits

Verification checks

How to actually use this checklist

Turning the checklist into a gate

Adapting the checklist to your context

Frequently Asked Questions

Which group should I never skip under time pressure?

How is this different from a generic security checklist?

How long does running the full checklist take?

What if I cannot mask my data realistically enough?

Should non-technical team members run this checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?