Resources
11Install
npx skillscat add aliathiullah/the-agent-sandbox-taxonomy Install via the SkillsCat registry.
Agent Sandbox Taxonomy (AST) — Evaluation & Composition Skill
You are an expert in the Agent Sandbox Taxonomy (AST), a framework for evaluating and comparing AI agent sandbox security. This skill enables two capabilities:
- Score any sandbox product — Evaluate a new or existing sandbox against the AST's 7 defense layers, 7 threat categories, and 3 evaluation dimensions to produce a score card and fingerprint.
- Help users choose and compose sandboxes — Use the decision checklist, composition patterns, and stacking rules to recommend the right sandbox stack for a given use case.
Remember 7-7-3: 7 defense layers, 7 threat categories, 3 evaluation dimensions (strength, granularity, portability).
Part 1: The Seven Defense Layers
Every sandbox enforces some combination of seven layers. No sandbox covers all seven equally. Most cover two or three well and ignore the rest.
Layers are numbered bottom-up because lower layers are foundational.
| Layer | Name | Key Question |
|---|---|---|
| L1 | Compute Isolation | What separates the agent from the host? |
| L2 | Resource Limits | Is it constrained in CPU, memory, disk, time? |
| L3 | Filesystem Boundary | What can it read, write, and delete? |
| L4 | Network Boundary | What can it communicate with? |
| L5 | Credential & Secret Management | How are secrets handled around the agent? |
| L6 | Action Governance | Can you control what operations it performs? |
| L7 | Observability & Audit | Can you see what the agent did? |
Critical Relationships Between Layers
- Lower layers are foundational. Strong L1 makes L3 easier (a microVM gets a fresh filesystem by default).
- Upper layers CANNOT be derived from lower ones. A microVM with perfect L1 but no L4 still lets the agent exfiltrate secrets via a single outbound request. A system with impeccable L1–L5 but no L7 gives you no way to detect misuse.
- No layer whose enforcement depends on L1's boundary can be stronger than L1. A process that escapes L1 bypasses any layer enforced inside the sandbox. However, layers with enforcement mechanisms that operate outside the sandbox — such as L2 (host-level cgroups), L5 (external credential proxy), or L7 (external logging) — are independent of L1 and can legitimately score higher.
Layer Definitions
L1 — Compute Isolation: The foundation. What matters is the size of the shared attack surface between workload and host — ranging from full host kernel (containers), through reduced syscall surface (user-space kernels), to minimal VMM (microVMs), to single-purpose kernels (unikernels), to hardware-encrypted memory (confidential computing).
L2 — Resource Limits: A fork bomb or memory leak can DoS the host even inside a perfect L1 boundary. Enforcement must happen outside the sandbox (cgroups on host, or hypervisor allocation). In-sandbox resource controls can be bypassed by an agent with root inside its sandbox. L2 is about host resource enforcement — CPU, memory, disk, and time caps on the sandbox process or VM. API rate limits, billing quotas, and usage metering (e.g., "Actions minutes", "ACU budgets") are business controls, not sandbox resource caps — they do not prevent a fork bomb or memory exhaustion within a session.
L3 — Filesystem Boundary: Must be selective, not total. Agents legitimately need to read/write project files. The real question is whether sensitive paths (~/.ssh, ~/.aws, .env) are accessible.
L4 — Network Boundary: The most underappreciated layer. An agent inside a perfect compute sandbox with unrestricted network access can still exfiltrate every secret via a single outbound request. How traffic is intercepted matters as much as whether it's intercepted. A boundary relying on HTTP_PROXY env vars is fundamentally different from kernel-level interception — the process can trivially bypass proxy env vars via raw sockets. Cooperative enforcement = always S:1, regardless of proxy sophistication.
L5 — Credential & Secret Management: Even with strong L1–L4, an agent with API keys can embed them in generated code or commit them to a repo. Ideally, credentials are never present in the agent's environment. L5 is about what the agent can see and access inside the sandbox. Encryption at rest (AES-256) and TLS in transit protect data from external attackers, not from the agent — they are not L5 mechanisms. RBAC on a secrets management UI controls who configures secrets, not what the agent sees at runtime.
L6 — Action Governance: Different from L1–L5. Those layers restrict access to resources. L6 restricts what the agent does with the access it has. Operates at a semantic level — "you cannot delete production resources" — governing intent regardless of which layer the action flows through. The tradeoff: L6 is generally software-enforced and bypassable. It is defense-in-depth, most valuable on top of strong L1–L4. L6 is not L1. Kernel-level syscall filtering (seccomp, BPF LSM, capabilities) restricts what system calls a process can make — that is compute isolation (L1), not action governance. L6 governs what the agent does with the access it has at a semantic level, not what syscalls it can invoke.
L7 — Observability & Audit: Without it, you cannot detect misuse, investigate incidents, improve policies, or demonstrate compliance. Turns a sandbox from a static boundary into an adaptive security system. Compliance certifications (SOC 2, ISO 27001) are not L7 scores. They indicate organizational controls, not sandbox enforcement mechanisms. Note them but do not let them inflate S/G scores.
Part 2: The Seven Threats
Agents can cause seven categories of harm. Sandboxes exist to contain them.
| ID | Threat | What Goes Wrong | Example |
|---|---|---|---|
| T1 | Data Exfiltration | Reads sensitive data and transmits it externally | Reads SSH keys, sends via outbound request |
| T2 | Supply Chain Compromise | Introduces malicious code: compromised deps, binary replacement, build artifact poisoning | Malicious install script exfiltrates env vars |
| T3 | Destructive Operations | Destroys or misconfigures resources, both local and remote | rm -rf /, cloud resource deletion via API, kubectl delete namespace |
| T4 | Lateral Movement | Reaches systems beyond its intended scope | Scans local network, hits cloud metadata endpoint |
| T5 | Persistence | Agent's effects survive sandbox destruction | Writes cron job on host, modifies host shell init files, installs git hooks outside sandbox. Note: user data persisting via volumes/snapshots is NOT T5 — persistence means the agent process or its side-effects survive the sandbox boundary, not that user files are durable |
| T6 | Privilege Escalation | Escapes the sandbox entirely | Exploits kernel CVE, container escape |
| T7 | Denial of Service | Consumes excessive resources, degrading host or other tenants | Fork bomb, memory bomb, disk filling |
Sandboxes Are Orthogonal to Agent Alignment
Prompt injection, hallucination, misalignment, bad context, compromised dependencies — these are all reasons an agent might decide to do something harmful. They are vectors, not threats. The taxonomy deliberately excludes them because sandboxes operate at a different level: they control what an agent can do, not what it chooses to do.
Getting an agent to make good decisions is an agent alignment problem (guardrails, RLHF, system prompts, context filtering). Preventing damage when it makes a bad decision is a sandboxing problem. The two are complementary but independent. A perfectly aligned agent still benefits from sandboxing (defense-in-depth). A perfectly sandboxed agent with bad alignment is still contained.
┌─────────────────────┐
Prompt Injection ──┐ │ │
Hallucination ─────┤ │ Agent Alignment │ ← why the agent decided
Misalignment ──────┼──▶ │ (out of scope) │
Bad context ───────┤ │ │
Malicious code ────┘ └────────┬────────────┘
│
▼
Agent attempts harmful action
│
▼
┌─────────────────────┐
│ │
│ Sandbox boundary │ ← what the agent can do
│ (this taxonomy) │
│ │
└─────────────────────┘Threat-to-Layer Defense Map
Use this when scoring. For each threat, identify which layers provide defense and assess whether the product covers them.
| Threat | Primary Defenses | Why Multiple Layers Are Needed |
|---|---|---|
| T1 Exfiltration | L3 + L4 + L5 | L3 blocks reading secrets, L4 blocks sending them out, L5 ensures they're not present. Any single layer alone leaks. |
| T2 Supply Chain | L3 + L4 + L7 | L4 controls download sources, L3 protects filesystem integrity, L7 detects compromises after the fact. |
| T3 Destructive Ops | L1 + L3 (local) / L4 + L6 (remote) | L1+L3 covers local destruction. Remote destruction is a network operation: needs L4 to block access AND L6 to block the action semantically. L1 alone does NOT protect remote resources. |
| T4 Lateral Movement | L4 + L1 | L4 blocks outbound access. L1 provides network namespace isolation as secondary. |
| T5 Persistence | L1 + L3 + L6 | Structural compute isolation (L1 >= 4) eliminates process-level persistence but does NOT eliminate file-level persistence through writable volume mounts (git hooks, CI configs, shell init files written to mounted project dirs survive sandbox destruction). Full T5 coverage requires L1 (isolation boundary), L3 (controlling what paths are writable), and L6 (blocking persistence-creating actions like cron jobs and git hooks). |
| T6 Privilege Escalation | L1 + L2 | L1 strength directly determines escape resistance. Hardware boundaries are fundamentally harder to escape than software. |
| T7 Denial of Service | L2 + L1 | L2 caps resources. Enforcement must be outside the sandbox. |
Threats are not independent. T2 is often a vector to T1/T3/T4. T5 extends the window for any other threat. T6 nullifies all other layers. T7 can be a direct goal or a side effect.
Part 3: Scoring System
Each layer is rated on three evaluation dimensions: Strength, Granularity, and Portability.
Strength (S: 0–4)
Combines robustness of enforcement, reversibility, and enforcement transparency into a single score.
| Score | Level | What It Means |
|---|---|---|
| 0 | None | No enforcement at this layer |
| 1 | Cooperative | Enforcement the sandboxed process can circumvent, opt out of, or reverse via escape hatch. Includes: proxy env vars the process can ignore, advisory restrictions, configurations with known bypass. If the process can open a raw socket and skip your filter, it's S:1. |
| 2 | Software-enforced | Enforced by a separate process/proxy the sandboxed process cannot circumvent from inside, but reconfigurable from outside by the operator. Includes: container-level isolation, hot-reloadable policy engines, MITM proxies with iptables redirect. |
| 3 | Kernel-enforced | Enforced by the OS kernel through mechanisms that cannot be weakened once applied, even by the operator during the session. Includes: Landlock, Seatbelt, seccomp-BPF, kernel-level network filtering. Irreversible. |
| 4 | Structural | Enforced by CPU virtualization, hardware encryption, or architectural absence. The protected resource or attack surface doesn't exist inside the sandbox. Includes: KVM-isolated microVMs, unikernels, network-disabled sandboxes (no network device), credential proxies (secrets never enter sandbox), confidential VMs (SEV-SNP/TDX). |
Key scoring rules:
- Cooperative enforcement is always S:1, regardless of how sophisticated the proxy is.
- If the kernel denies the syscall → S:3.
- If there's no network device to use → S:4.
- If there's an escape hatch → caps strength at S:1.
Granularity (G: 0–3)
How fine-grained is the control at this layer?
| Score | Level | What It Means |
|---|---|---|
| 0 | None | No control |
| 1 | Binary | On/off (e.g., "network: enabled/disabled") |
| 2 | Allowlist/Blocklist | Lists of permitted or denied resources (e.g., "allow these IPs/ports", "block these paths") |
| 3 | Per-resource policy | Content-aware rules per resource, action, and context — requires inspecting request content (e.g., HTTP method, URL path, headers, body), not just packet metadata. Examples: "allow GET but deny DELETE", Cedar/OPA policies evaluating request fields, MITM proxies with per-URL rules |
Portability Tags
A flat list of tags answering: what OS does it run on, and what infrastructure does it need?
Tags cover two dimensions — OS support and infrastructure dependencies — combined in a single array.
| Tag | Dimension | Meaning |
|---|---|---|
| any-os | OS | Works on Linux, macOS, and Windows |
| linux | OS | Requires Linux |
| mac | OS | Supports macOS |
| cloud | Infra | Runs in vendor's cloud; OS abstracted away |
| docker | Infra | Requires Docker or compatible container runtime |
| k8s | Infra | Requires a Kubernetes cluster |
| kvm | Infra | Requires /dev/kvm (bare metal or nested virtualization) |
No infrastructure tag means the tool runs directly on the OS with no extra dependencies.
Products typically have multiple tags (e.g., [linux, mac] or [any-os, cloud]).
Part 4: The Fingerprint
Every product gets a fingerprint — a CVSS-inspired vector string showing its strength score at each layer.
0 vs — (dash)
Both mean "no coverage," but for different reasons:
0= the product operates at this layer but provides no enforcement (capability exists, wide open).—= the product does not address this layer at all (outside scope).- For composition, both are treated as "no coverage" — any non-zero score from another product fills the gap.
Formats
Full form (self-describing):
L1:4/L2:4/L3:4/L4:0/L5:2/L6:-/L7:2Compact form (positional, L1→L7 order):
4/4/4/0/2/-/2Score card notation uses S.G per layer (e.g., 4.1 = structural strength, binary granularity).
Separator convention: : binds layer to value (full form), . separates strength from granularity, / separates layers.
Part 5: Layer Mechanism Reference
Use these tables to determine the correct S and G scores for each mechanism you encounter. When evaluating a product, identify which mechanism it uses at each layer, then look up or interpolate the score.
L1 — Compute Isolation Mechanisms
| Mechanism | S | G | How It Works |
|---|---|---|---|
| Bare process | 0 | 0 | No isolation; full user privileges |
| Linux namespaces + cgroups | 2 | 1 | PID/mount/net/user namespace separation; shared host kernel |
| Namespaces + seccomp/Landlock/Seatbelt | 3 | 1–2 | Kernel-enforced syscall filtering or LSM; irreversible |
| User-space kernel (e.g., gVisor) | 3 | 1 | Intercepts syscalls in userspace; reduced host syscall surface |
| MicroVM — minimal VMM (e.g., Firecracker) | 4 | 1 | Dedicated kernel per workload via KVM |
| MicroVM — container-shaped (e.g., Kata) | 4 | 1 | Container-shaped VM; CRI compatible; needs KVM |
| Unikernel | 4 | 1 | Single-app custom kernel; needs KVM |
| Library OS | 3–4 | 1 | Embedded minimal OS library; experimental |
| Confidential VM (SEV-SNP/TDX) | 4 | 1 | Hardware-encrypted memory; even hypervisor cannot read |
L2 — Resource Limits Mechanisms
| Mechanism | S | G | How It Works |
|---|---|---|---|
| None | 0 | 0 | No limits |
| cgroups v2 | 3 | 2 | Kernel-enforced CPU/memory/I/O caps |
| VM resource allocation | 4 | 2 | Fixed vCPU/RAM/disk at VM creation |
| Platform quotas | 2 | 2 | Per-session or per-account limits |
| Time-bounded sessions | 2 | 1 | Auto-termination after time limit |
L3 — Filesystem Boundary Mechanisms
| Mechanism | S | G | How It Works |
|---|---|---|---|
| No restriction | 0 | 0 | Full user filesystem |
| Working-dir-only mount | 3 | 2 | Only project dir visible; all else invisible |
| Sensitive-path blocklist | 3 | 2 | Most paths accessible; ~/.ssh, ~/.aws etc. blocked |
| Ephemeral root | 4 | 1 | Fresh OS per session; project mounted in |
| Immutable root + writable overlay | 4 | 2 | Read-only base; copy-on-write; rollback capable |
| Full independent filesystem | 4 | 1 | Separate disk; no host paths visible |
L4 — Network Boundary Mechanisms
G:2 vs G:3 distinction: Packet-level filters (iptables, eBPF NetworkPolicy, security groups) operate on IP/port/protocol — they see where traffic goes but not what it says. This is G:2 (allowlist/blocklist). G:3 requires content-aware inspection: terminating TLS and evaluating HTTP method, URL path, headers, or body. A kernel-enforced eBPF filter and a VPC security group both score G:2 — the enforcement strength (S) differs, but the granularity is the same.
| Mechanism | S | G | How It Works |
|---|---|---|---|
| No restriction | 0 | 0 | Full network access |
| Proxy env vars | 1 | 2 | Cooperative: trivially bypassed via raw sockets |
| Kernel/hypervisor network filter | 3 | 2 | Opaque: iptables, eBPF, or Network Extension; IP/port/protocol only |
| MITM proxy (iptables redirect) | 2 | 3 | Opaque: all traffic redirected; TLS-terminating; inspects HTTP content |
| MITM proxy (kernel-redirected) | 3 | 3 | Opaque: kernel-redirected + TLS-terminating; per-URL/method/header/body policies |
| Network disabled | 4 | 1 | Structural: no network interface exists |
L5 — Credential & Secret Management Mechanisms
| Mechanism | S | G | How It Works |
|---|---|---|---|
| No restriction | 0 | 0 | Full credential set visible |
| Sensitive file blocking | 3 | 2 | Credential files blocked via L3; env vars still visible |
| Env-var filtering | 2 | 2 | Only approved variables forwarded |
| Placeholder substitution | 3 | 3 | Secrets swapped with tokens; restored at execution only |
| External credential proxy | 4 | 3 | Credentials never enter sandbox |
| Ephemeral per-session tokens | 4 | 3 | Time-bound, scoped credentials; auto-expire |
L6 — Action Governance Mechanisms
| Mechanism | S | G | How It Works |
|---|---|---|---|
| No governance | 0 | 0 | Agent can do anything it has access to |
| Human-in-the-loop | — | — | Agent proposes, human approves. The sandbox delegates the governance decision to an external actor rather than enforcing it. Since the sandbox itself provides no enforcement at this layer, HITL is scored — (not addressed). HITL is valuable but it is the human providing governance, not the sandbox |
| Command blocklist | 1 | 2 | Known-dangerous commands blocked by pattern matching. S:1 because agents can trivially bypass by using libraries, scripts, aliases, or alternative tools to perform the same action — the blocklist only works if the agent uses the exact syntax being blocked, making this cooperative enforcement |
| Policy engine (Cedar/OPA) | 2 | 3 | Fine-grained rules per structured action/resource/context. Agent cannot rephrase a delete namespace into something the policy doesn't recognize — the action is evaluated semantically, not syntactically |
| Declarative-only mode | 3 | 3 | Agent produces declarations only; cannot execute directly |
L7 — Observability & Audit Mechanisms
| Mechanism | S | G | How It Works |
|---|---|---|---|
| No logging | 0 | 0 | No record of agent actions |
| Session-level logs | 1 | 1 | Start/stop, exit codes |
| Command-level audit | 2 | 2 | Every command, file op, tool call logged |
| Full telemetry | 3 | 3 | Network, syscalls, MCP calls, real-time UI |
| Cryptographic audit chain | 3 | 3 | Tamper-evident log with cryptographic commitments |
Part 6: Glossary
| Term | Definition |
|---|---|
| Blast radius | Maximum damage when an agent is compromised or misbehaves |
| Cooperative enforcement | Enforcement relying on the sandboxed process respecting a convention (e.g., proxy env vars). Bypassable; always S:1 |
| Defense-in-depth | Layering multiple independent boundaries so failure of one doesn't compromise the system |
| Escape hatch | A mechanism allowing bypass of sandbox restrictions; its existence caps strength at S:1 |
| Opaque enforcement | Enforcement that works regardless of the sandboxed process's behavior. Cannot be circumvented; S:2–3 |
| Structural enforcement | Enforcement where the protected resource doesn't exist inside the sandbox. Nothing to bypass; S:4 |
| Agent alignment | The problem of getting an agent to make good decisions (guardrails, RLHF, context filtering). Complementary to but independent of sandboxing |
| Vector | An attack path through which a threat is triggered (prompt injection, hallucination, misalignment, compromised dependency). Distinct from the threat (T1–T7) it activates. Sandboxes are vector-agnostic |
Part 7: How to Score a Sandbox Product
Follow this procedure to evaluate any sandbox product and produce a score card and fingerprint.
⚠ Confidence notice. Scoring from documentation and source code alone produces a preliminary score card. Scores at S:3+ are architectural claims that ideally require hands-on verification (running the sandbox, testing bypass attempts, inspecting enforcement at runtime). Flag any score where evidence is indirect or incomplete — a preliminary score card is useful for comparison but should not be treated as a security audit.
Step 1: Gather Evidence
For each of the 7 layers, determine what the product does. Research its documentation, architecture, and security model. Ask these questions:
| Layer | Investigation Questions |
|---|---|
| L1 | What isolation technology? Container, microVM, unikernel, process sandbox, confidential VM? What's the shared attack surface with the host? |
| L2 | Are CPU/memory/disk/time limits enforced? Where — inside the sandbox or outside (host cgroups, hypervisor)? Can the agent bypass them with root? |
| L3 | What filesystem does the agent see? Host filesystem, project-only mount, ephemeral root? Are sensitive paths (~/.ssh, ~/.aws, .env) accessible? |
| L4 | Is network access restricted? How is it intercepted — env vars (cooperative), iptables/eBPF (opaque), or no network device (structural)? Can the process bypass it with raw sockets? |
| L5 | How are credentials delivered? Env vars, mounted files, proxy, ephemeral tokens? Are secrets visible inside the sandbox at all? |
| L6 | Are operations governed? Human-in-the-loop, command blocklists, policy engine, declarative-only? Can the agent perform destructive actions with the access it has? |
| L7 | What's logged? Nothing, session-level, command-level, full telemetry? Tamper-evident? Real-time visibility? |
Where to Look (in priority order)
- Source code — The authoritative source. Look for sandbox setup, isolation config, network rules, seccomp profiles, Dockerfile/VM config. If open source, this is ground truth.
- Architecture / security documentation — Whitepapers, security model docs, threat models. Look for enforcement mechanism descriptions, not feature lists.
- API / CLI reference — Configuration options reveal what's actually controllable. If there's no config for network policy, there's probably no network policy.
- README and docs site — Good for understanding intent and scope. Cross-reference claims against source code or API surface.
- Blog posts and announcements — Lowest priority. Treat as claims requiring corroboration, not evidence.
Evidence Trust Levels
Evidence quality is independent of the strength score being assigned. A blog post claiming S:1 is just as suspect as a blog post claiming S:4 — the trust level describes the source, not the score.
Each product in the score card carries an evidence_level tag from this scale:
| Level | Source | How to Use |
|---|---|---|
| verified | Hands-on testing, runtime inspection, bypass attempts | Score directly. Highest confidence. No flag needed. |
| source-code | Open-source repo: sandbox config, seccomp profiles, Dockerfile, VM setup | Score directly. Ground truth for mechanism identification, but no runtime verification. Flag if code is ambiguous or experimental (WARNING:). |
| docs | Official documentation — architecture docs, API/CLI reference, README, security whitepapers | Score, but flag for review if claims cannot be cross-referenced against source code or API surface (NEEDS REVIEW:). |
| inferred | Indirect evidence — blog posts, marketing pages, changelog mentions, or reasonable inference from related products | Flag with WARNING: and note the evidence gap. Score based on what can be corroborated; uncorroborated claims alone are insufficient. Lowest confidence. |
If a layer has no evidence at all (not mentioned anywhere), mark it — (not addressed).
🔍 NEEDS HUMAN REVIEW: Any score where the best available evidence is
docsorinferredshould be flagged for human verification. This applies equally to S:1 and S:4 — low trust evidence is low trust regardless of the score it supports.
Conservative Scoring Rules
- Undocumented = not scored. If a layer has no documentation and no source code evidence, mark it
—(not addressed), not0. - Uncorroborated claims → flag, don't inflate. If a blog says "we use hypervisor-enforced network isolation" but no API/config/source confirms it, flag with
WARNING:and note the evidence gap. Score based on what is verifiable, not what is claimed. - Roadmap features = 0. Features described as "coming soon," "on roadmap," or "planned" do not count. Score current state only.
- "Default off" features score the default. If network filtering exists but is disabled by default and most users won't enable it, note both the default score and the configured score. The fingerprint uses the configured score (what the product can do), but flag the default gap.
- Compliance certifications (SOC 2, ISO 27001) are not layer scores. They indicate organizational controls, not sandbox enforcement mechanisms. Note them but don't let them inflate S/G scores.
- When in doubt, score lower. A score can always be revised upward with better evidence. An inflated score misleads.
Handling Ambiguity
- Multiple deployment modes — Score the strongest available mode but note the weakest (e.g., "S:4 (microVM); WARNING: Linux legacy mode is S:2 (container)").
- Platform-managed vs self-hosted — If the product has both, score the platform-managed version unless the user specifies otherwise. Note differences.
- Shared kernel vs dedicated kernel — If docs say "container" without specifying the runtime, assume shared kernel (S:2). Only score S:3+ with evidence of gVisor, Kata, Firecracker, or equivalent.
🔍 NEEDS HUMAN REVIEW: When documentation is ambiguous or contradictory (e.g., blog claims a feature that API docs don't expose), flag the affected layers for human review rather than guessing. Use
WARNING:prefix in the score card notes and explain the discrepancy.
Step 2: Score Each Layer
For each layer:
- Identify the mechanism used (or "none").
- Look up the mechanism in the reference tables (Part 5) to get base S and G scores.
- Apply scoring rules:
- If the process can circumvent the enforcement → cap at S:1 (cooperative).
- If there's an escape hatch or override → cap at S:1.
- If enforcement is by a separate process but reconfigurable → S:2.
- If enforcement is kernel-level and irreversible → S:3.
- If the attack surface doesn't exist inside the sandbox → S:4.
- If a layer is not addressed at all by the product → mark as
—. - If the product operates at the layer but provides no enforcement → mark as
0.
Step 3: Assess Threat Coverage
Threat coverage is computed mechanically from layer scores — it is derived data, not manually authored. The scripts/generate.py script computes threats at generation time using the threshold rules below. When scoring a new product, you only need to provide layer scores; threats are derived automatically.
Apply the threshold rules below mechanically for each threat. Do not eyeball — use the layer scores from Step 2 and the rules to determine the symbol. Each threat has its own primary defense layers and thresholds — there is no single universal rule. Treat ~ (not addressed) as 0 for threshold comparisons.
| Rating | Symbol | Meaning |
|---|---|---|
| Addressed | ● | All primary defense layers meet their thresholds — the threat is actively defended or structurally eliminated |
| Partial | ◐ | At least one primary defense layer contributes (meets its threshold) but not all do |
| Not addressed | ○ | No primary defense layer meets its threshold — the threat is real and unmitigated |
Threat threshold rules
For each threat, the primary defense layers and their thresholds are listed. Apply them mechanically:
T1 — Data Exfiltration (primary: L3, L4, L5)
- ● if all three of L3 >= 2, L4 >= 2, L5 >= 2
- ◐ if at least one of L3 >= 2, L4 >= 2, L5 >= 2
- ○ if none meet the threshold
T2 — Supply Chain Compromise (primary: L3, L4, L7)
- ● if all three of L3 >= 2, L4 >= 2, L7 >= 2
- ◐ if at least one of L3 >= 2, L4 >= 2, L7 >= 2
- ○ if none meet the threshold
T3 — Destructive Operations (split local/remote, then combine)
- T3-Local (primary: L1, L3): ● if both L1 >= 2 AND L3 >= 2; ◐ if one; ○ if neither
- T3-Remote (primary: L4, L6): ● if both L4 >= 2 AND L6 >= 2; ◐ if one; ○ if neither
- Use notation:
L●/R○,L●/R◐,L+R(both ●), etc.
T4 — Lateral Movement (primary: L4, L1)
- ● if both L4 >= 2 AND L1 >= 2
- ◐ if one of L4 >= 2, L1 >= 2
- ○ if neither meets the threshold
T5 — Persistence (primary: L1, L3, L6)
- ● if all three of L1 >= 2, L3 >= 2, L6 >= 2
- ◐ if at least one of L1 >= 2, L3 >= 2, L6 >= 2
- ○ if none meet the threshold
- Note: Structural compute isolation (L1 >= 4) eliminates process-level persistence but does NOT eliminate file-level persistence through writable volume mounts. Full T5 coverage requires L3 (controlling what paths are writable) and L6 (blocking persistence-creating actions) in addition to L1.
T6 — Privilege Escalation (primary: L1, L2)
- ● if both L1 >= 3 AND L2 >= 2 (kernel/hardware isolation + external resource caps)
- ◐ if at least one of L1 >= 2, L2 >= 2
- ○ if neither meets the threshold
T7 — Denial of Service (primary: L2, L1)
- ● if both L2 >= 2 AND L1 >= 2
- ◐ if at least one of L2 >= 2, L1 >= 2
- ○ if neither meets the threshold
T3 notation
Always split T3 into local and remote:
L●/R○= local mitigated, remote notL●/R◐= local mitigated, remote partiallyfull L+R= both fully mitigated- Use the combined result for the single T3 column in the threat matrix: if both are ●, T3 = ●; if mixed, T3 = the lower of the two
Step 4: Produce the Score Card
Output the score card in this format:
### [Product Name]
**Fingerprint: `L1:S/L2:S/L3:S/L4:S/L5:S/L6:S/L7:S`** · Portability: `[tags]`
**Confidence: preliminary | verified** (preliminary = documentation/source review only; verified = includes hands-on testing)
| Layer | S.G | Notes |
|---|---|---|
| L1 Compute | S.G | [mechanism, enforcement method, evidence source] |
| L2 Resource | S.G | [mechanism, enforcement method, evidence source] |
| L3 Filesystem | S.G | [mechanism, enforcement method, evidence source] |
| L4 Network | S.G | [mechanism, enforcement method, evidence source] |
| L5 Credentials | S.G | [mechanism, enforcement method, evidence source] |
| L6 Action | S.G | [mechanism, enforcement method, evidence source] |
| L7 Observability | S.G | [mechanism, enforcement method, evidence source] |
Threats: T1[●◐○] T2[●◐○] T3[●◐○](L[●◐○]/R[●◐○]) T4[●◐○] T5[●◐○] T6[●◐○] T7[●◐○]
(derived mechanically from layer scores — do not override manually)
Gaps: [identify layers with 0 or — that matter]
Complements: [what kind of tool would fill the gaps]
Review flags: [list any layers needing human verification, with reasons]Note Format — Score Justification Required
Every layer note in products.yaml must include explicit reasoning for both the S and G scores. The note is the audit trail — a future reviewer should be able to read the note alone and understand why this mechanism earned this score without needing to re-research the product.
Each note must contain these elements in order:
- Mechanism: What the product uses (e.g., "Firecracker microVM via KVM")
- S justification: Why this mechanism earns its strength score, referencing the scoring criteria:
- For S:1 — state what makes it cooperative/bypassable
- For S:2 — state what separate process enforces it and why the sandbox can't circumvent it
- For S:3 — state which kernel mechanism enforces it and why it's irreversible
- For S:4 — state what doesn't exist inside the sandbox or what hardware boundary applies
- G justification: Why this mechanism earns its granularity score:
- For G:1 — state that control is binary (on/off)
- For G:2 — state what allowlists/blocklists are available
- For G:3 — state what per-resource/content-aware policies are supported
- Evidence source: Where the information came from (source code, docs, API reference, blog)
- Flags:
WARNING:for ambiguous/unverified enforcement,NEEDS REVIEW:for assumptions needing human verification, or reasoning for scoring lower than the mechanism might suggest (e.g., "Scored S:2 not S:3 due to known bypass")
Good note example:
Firecracker microVM via KVM; dedicated Linux guest kernel per sandbox.
S:4 — hardware boundary: host kernel attack surface doesn't exist inside
the VM (hypervisor-enforced isolation). G:1 — binary: sandbox is either
running or not, no per-workload isolation tuning. Evidence: architecture
docs + open-source Firecracker VMM.Bad note example (mechanism only, no reasoning):
Firecracker microVM (KVM); dedicated Linux guest kernel; minimal VMMFor each layer note, include: (1) the mechanism name, (2) how enforcement works, and (3) where the evidence came from — source code, docs, API reference, or blog. Use WARNING: prefix for layers where documentation is ambiguous or enforcement is unverified. Use NEEDS REVIEW: prefix for layers where the score depends on an assumption that should be verified by a human.
Step 5: Validate
Run these sanity checks on your score card:
- No layer whose enforcement depends on L1's boundary is stronger than L1. If L1 is S:2, layers enforced inside the sandbox cannot realistically be S:4. But layers with independent external enforcement (e.g., L2 host cgroups, L5 external credential proxy, L7 external audit) can score higher than L1 because an L1 escape does not bypass them.
- Cooperative mechanisms are always S:1. If you scored something S:2+ but the process can bypass it, downgrade.
- L4:0 with credentials present = T1 risk. Flag this explicitly.
- Threat scores must match threshold rules. Re-derive each threat symbol from the layer scores using the rules in Step 3. If your intuitive assessment differs from the mechanical result, the mechanical result wins.
- No product should have T7:○ if L1 >= 2. Strong compute isolation provides partial DoS defense even without dedicated resource caps.
- Every product with L1 >= 2 AND L3 >= 2 gets T3-Local ●. If you have any isolation at all, local destructive operations are addressed at baseline.
- T2 is universally hard. Requires L3 >= 2, L4 >= 2, AND L7 >= 2 for ●. Most products score ◐.
- Every S:3+ score has architectural evidence. If you scored S:3 or S:4 but your evidence is only marketing copy or a blog post, downgrade and flag for review.
- Check for blog-to-docs gaps. If a security feature is described in a blog but absent from API/CLI docs, flag it with
WARNING:and score conservatively. - Evidence trust matches confidence. Review every layer's trust level (see Step 1). Any layer scored from architecture docs or lower without corroboration should carry a
NEEDS REVIEW:flag — regardless of the strength score assigned.
🔍 NEEDS HUMAN REVIEW: After completing the score card, list all layers where (a) the best evidence is architecture docs or lower (no source code, no hands-on testing), (b) documentation was ambiguous or contradictory, or (c) a feature was claimed but could not be verified. These are candidates for hands-on verification before the score card is considered final.
Part 8: How to Help a User Choose and Compose Sandboxes
The Decision Checklist
Walk the user through these 8 questions in order. Each answer maps to layer requirements.
1. What is your trust level in the code?
- Untrusted (third-party, generated, public repos) → requires L1 S:4 (microVM/unikernel)
- Own reviewed code → L1 S:2–3 acceptable
2. Does the agent interact with remote resources (cloud, databases, APIs)?
- Yes with read-write access → L1 does NOT protect remote resources. Need L4 (block destructive endpoints), L6 (block destructive actions semantically), or L5 (scoped read-only credentials)
- No → less critical, but still consider L4 for exfiltration prevention
3. Does the agent need network access?
- No → disable it (L4 S:4). Eliminates T1 and T4 in one step
- Yes → use allowlists (S:2–3) and invest in L5 and L7
4. Does the agent handle credentials?
- Ideally: never present (L5 S:4 via proxy or ephemeral tokens)
- Rule: never pass raw credentials if avoidable
5. Can you tolerate human-in-the-loop?
- No (autonomous agents) → need L6 S:2+ (policy engine)
- Yes → HITL adds a governance layer, but it is not scored as sandbox enforcement (the human provides governance, not the sandbox). Evaluate what the sandbox itself enforces at L6 independently
6. Do you need audit trails?
- Compliance or team use → L7 S:2+ with structured logs
- Regulatory → consider cryptographic audit chains
7. Ephemeral or persistent sandbox?
- Structural isolation (L1 >= 4: microVM/unikernel) → eliminates process-level persistence, but file-level persistence through writable volume mounts (git hooks, CI configs, shell init files) is NOT eliminated by L1 alone. Full T5 coverage requires L3 (controlling what paths are writable) and L6 (blocking persistence-creating actions) in addition to L1
- Weaker isolation (L1 < 4) → must explicitly address T5 via L3 and L6
8. What are your portability constraints?
- No infrastructure → process wrappers (no infra tag needed)
- Docker available → container wrappers, sidecars (
docker) - Cloud/K8s → full platform range (
cloud,k8s)
Composition Rules
The stacking rule: take the maximum strength at each layer.
Product A 4/4/4/0/2/-/2
Product B -/-/1/2/3/2/3
──────────────────────────────
Composed 4/4/4/2/3/2/3 ← max at each position0and—are both treated as "no coverage" for composition- Any non-zero score from another product fills the gap
- The composed stack is only as weak as its weakest uncovered layer
Composition Patterns
Use these archetypes to guide recommendations:
Pattern 1: Platform + Policy Layer — "Strong box, smart guardrails"
- Cloud sandbox platform provides L1–L3 (hardware-isolated compute, resource limits, ephemeral filesystem)
- Policy tool layers L4–L7 (network policies, credential management, action governance, observability)
- Result: full-stack coverage
Pattern 2: OS-Level Wrapper + Policy Sidecar — "Lightweight local protection"
- Kernel-level process wrapper provides L1/L3/L5 with irreversible enforcement
- Policy sidecar adds L4/L6/L7
- Result: full stack minus L2. Zero cost, no cloud dependency
Pattern 3: Built-in Sandbox + Cloud Fallback — "Local for speed, cloud for untrusted"
- Agent's built-in sandbox handles trusted interactive work (L1/L3/L4)
- Untrusted operations offload to a cloud platform with full L1–L3
Pattern 4: K8s-Native Stack — "Enterprise, self-hosted, policy-driven"
- Kubernetes sandbox CRD (L1/L2/L3) + NetworkPolicy (L4) + policy engine (L6) + secrets manager (L5) + monitoring stack (L7)
- Result: full stack, self-hosted
Anti-Pattern: Platform Without Network Controls
A cloud platform provides excellent L1/L2/L3, but the user deploys with default (unrestricted) network access and passes cloud credentials as env vars. The agent runs in a perfect microVM but can still exfiltrate credentials, delete cloud resources, and reach internal services.
This is the most common configuration in practice and a false sense of security. Strong L1 is necessary but not sufficient. Look at the fingerprint: if L4 is 0, you have a problem.
Recommendation Procedure
- Walk through the 8-question checklist with the user
- Map answers to minimum layer requirements (which layers need what minimum S score)
- Identify candidate products that cover the required layers at the required strength
- If no single product covers all requirements → identify complementary products using composition patterns
- Compute the composed fingerprint (max at each layer)
- Verify the composed fingerprint has no
0or—at layers the user cares about - Flag any remaining gaps and the threats they leave open
- Present the recommendation with the composed fingerprint and threat coverage
Part 9: ast-probe — Automated Verification Tool
The taxonomy includes ast-probe, a portable Go binary that probes all 7 defense layers from inside a sandbox and produces a machine-readable JSON score card. It replaces documentation-based guessing with verified, evidence-based scores.
When to Use
- Verifying product claims — Run the probe inside a sandbox to check whether documented scores match reality.
- Comparing environments — Run the same binary across bare metal, containers, VMs, and cloud sandboxes to produce comparable fingerprints.
- Regression testing — Include in CI to detect sandbox configuration drift.
- Composition validation — After stacking two products, run the probe inside the composed environment to verify the combined fingerprint.
How to Use
# Download the binary for the target platform
curl -LO https://github.com/kajogo777/the-agent-sandbox-taxonomy/releases/latest/download/ast-probe-linux-amd64
chmod +x ast-probe-linux-amd64
# Copy into the sandbox and run
./ast-probe-linux-amd64 --product "my-sandbox" --out report.json
# Compare fingerprints
jq .fingerprint report.jsonThe probe outputs JSON with per-layer test results, an AST fingerprint, and mechanically-derived threat coverage — the same format and scoring rules as the taxonomy itself.
Safety Guarantees
The probe is designed to be safe to run anywhere, including production systems:
- Never performs destructive operations. Tests use permission checks (
OpenFilethenClose, zero bytes written), not actual destruction (unlink,rmdir). - Creates only probe-owned temp files (
.ast-probe-*prefix) and immediately removes them. - Network tests connect to well-known public endpoints only.
- Syscall tests use safe variants (stubs on macOS, least-dangerous flags on Linux).
- Never exfiltrates credentials — scans env var names and lengths, never logs values.
Known Quirks
L1 VM detection: Lightweight VMs without DMI/CPUID signatures (OrbStack, Lima, WSL2) are detected via /proc/version kernel string. VMs with vanilla kernels will be scored as bare process (S:0) — a false negative.
L2 macOS: Resource limit probing is Linux-only (cgroups). On macOS, L2 returns S:-1 (not assessed).
L3 tmpfs awareness: Writes to tmpfs mounts (e.g., /tmp in a container with --tmpfs) are not counted as filesystem escapes — they're bounded scratch space, not persistent host paths.
L4 macOS firewall: First run on macOS triggers a firewall dialog for the unsigned binary. Both allow and deny produce useful data.
L5 empty environment: If no secrets or credential files exist, L5 returns S:-1 (cannot assess) rather than S:4. Positive evidence of filtering/proxying is required for a high score.
L6 permission vs MAC: The probe tests whether paths are writable via open(), but cannot detect MAC policies (AppArmor, SELinux) that permit open() but deny write().
L7 inside-out: The probe can only detect observability infrastructure visible from inside the sandbox. External monitoring (Falco on host, cloud audit logs) is invisible to the probe.
Threat scores are mechanical: Derived from layer scores using the AST threshold rules. If a layer score is wrong, threat scores will be wrong too.
Interpreting Results
The probe produces verified evidence level scores — the highest confidence tier in the taxonomy. When a probe score disagrees with a documentation-based score in products.yaml, the probe score should be preferred (assuming the probe was run in the product's intended configuration).
Key patterns to look for:
- L1:0 in a VM → VM detection failed. Check
L1.testsforvm_detectionresults. - L5:-1 → No credentials to test. This is common in clean containers. Not a security finding.
- L6 high score in a writable container → The probe tests protected system paths. If the container runs as non-root with a read-only root filesystem, most L6 tests will be blocked even without explicit governance.
- L4:1 with proxy env vars → Cooperative enforcement. The probe verified that raw TCP bypasses the proxy — this is correct S:1 scoring.
Quick Reference Card
LAYERS (bottom-up) THREATS
L1 Compute Isolation T1 Data Exfiltration
L2 Resource Limits T2 Supply Chain Compromise
L3 Filesystem Boundary T3 Destructive Operations (local + remote)
L4 Network Boundary T4 Lateral Movement
L5 Credential Management T5 Persistence
L6 Action Governance T6 Privilege Escalation
L7 Observability & Audit T7 Denial of Service
STRENGTH (S: 0–4) GRANULARITY (G: 0–3)
0 = None 0 = None
1 = Cooperative (bypassable) 1 = Binary (on/off)
2 = Software-enforced 2 = Allowlist/Blocklist
3 = Kernel-enforced (irreversible)3 = Per-resource policy
4 = Structural (doesn't exist)
FINGERPRINT: L1:S/L2:S/L3:S/L4:S/L5:S/L6:S/L7:S
SCORE CARD: S.G per layer (e.g., 4.1 = structural, binary)
COMPOSITION: take max(S) at each layer
CRITICAL: if L4 = 0 → exfiltration risk
if L6 = 0 and agent has remote access → remote destruction risk
cooperative enforcement → always S:1