Guardrails: input/output filtering, sandboxing, and scoping.
"Add guardrails" is the most over-promised phrase in agent security. Guardrails are real and necessary, but only when you know which kind does which job and which ones fail open. This essay separates the four guardrail families — input, output, execution sandboxing, and capability scoping — and shows how to layer them so the weak ones cover the strong ones' gaps.
A taxonomy that prevents wishful thinking
Most guardrail disappointment comes from category confusion: teams deploy a content classifier (a probabilistic detector) and expect it to behave like a sandbox (a hard boundary). They are different tools with different failure modes.
- Probabilistic guardrails — input/output classifiers, LLM-based judges. They estimate "is this bad?" They fail open: an attacker who crafts around the model gets through, silently.
- Deterministic guardrails — schema validation, allowlists, sandboxes, capability scoping. They enforce "this is structurally not permitted." They fail closed: the disallowed action does not happen even against a novel attack.
Design rule: put your trust in deterministic guardrails and your visibility in probabilistic ones. A classifier is a smoke detector, not a fire door. Build the fire door.
Input guardrails
Input guardrails inspect content before it reaches the model: user messages, retrieved chunks, tool results. Useful jobs they can do:
- Detect and flag known injection patterns and jailbreak shapes (raise attacker cost, gather signal).
- Strip or neutralize structurally dangerous content: HTML comments, hidden Unicode, zero-width characters, embedded markup in retrieved text.
- Enforce hard input limits: length, allowed languages/encodings, schema for structured inputs, source allowlists for retrieval.
- Tag provenance: mark each span as trusted (operator) or untrusted (retrieved/tool) so downstream layers can treat them differently.
The deterministic parts (stripping hidden markup, enforcing schemas, provenance tagging) are durable. The classifier part is best-effort. Do not let "we have an input filter" stand in for the structural controls below.
Output guardrails
Output guardrails inspect what the model produced before anything acts on it. There are two distinct kinds, and the second matters far more for agents.
Content output checks
Scan generated text for leaked secrets, PII, disallowed content, or exfiltration markup (e.g. external image URLs encoding context). Strip or block before the response reaches a user or a renderer.
Action output checks (the critical one)
Before any tool call executes, validate it deterministically: is this tool allowed in this state? Do the arguments match a strict schema? Is the destination on the allowlist? Is the operation within rate/volume limits? This is where you stop the agent from acting on a successful injection, and it must not be implemented as another LLM the same attacker prompt can talk past.
# Action guardrail: deterministic check before execution def allow_tool_call(call, state): if call.name not in state.allowed_tools: return deny("tool not permitted in this state") if not schema_valid(call.name, call.args): return deny("argument schema violation") if call.name == "send" and call.args.to not in state.allowlist: return deny("destination not allowlisted") return allow()
Execution sandboxing
Some agents run code or shell commands. For these, the guardrail is not a classifier on the command string — it is the environment the command runs in. Assume the command is hostile and design the box so a hostile command is contained.
- Isolation: a disposable container/microVM per task, no host mounts, dropped privileges, killed after use.
- Network egress default-deny: the sandbox cannot reach arbitrary hosts; outbound is allowlisted or absent. This neutralizes most exfiltration even if code execution is achieved.
- Resource caps: CPU, memory, wall-clock, and disk limits to contain resource-abuse and runaway loops.
- No ambient credentials: the sandbox holds no secrets or cloud metadata access by default; capabilities are injected narrowly and per-task.
A sandbox protects the host. It does not protect the data you deliberately handed into the sandbox, and it does not stop misuse of tools the agent legitimately calls from inside it. Sandboxing is necessary, not sufficient — pair it with capability scoping.
Capability scoping — the guardrail that subsumes the rest
The most effective guardrail is the action you never made possible. Capability scoping means each agent (and each sub-agent) is granted exactly the tools, data scopes, and credentials its task requires, and nothing more — enforced at the boundary, not requested in the prompt.
- Per-task, time-bound, narrowly-scoped credentials instead of long-lived broad ones.
- Read tools that are parameterized and result-limited rather than general query interfaces.
- Trust separation: a tool-less reader for untrusted content; a privileged actor that never ingests raw untrusted text.
- State machines: tools available only in states where they are valid, so an injected instruction cannot summon an out-of-phase capability.
As a probabilistic outer layer, yes — an independent judge model catches some failures and adds signal. As the control that decides whether a destructive tool runs, no: it shares the prompt-injection failure mode of the agent it is judging, and a payload that fools one can be crafted to fool both. The decision to execute a high-impact action must rest on deterministic code — schema, allowlist, state — not on a second model's opinion.
You don't try to. Input guardrails should be allowlist-shaped where the input is structured (enforce a schema, a source list, an encoding) and best-effort-detection-shaped where it is free text (flag, log, raise cost). Acknowledge the free-text classifier will miss things and design so a miss is survivable — that survivability comes from the deterministic inner layers, not from a perfect filter.