Guardrails, in plain words

Concepts · Core Building Blocks

Guardrails, in plain words.

Guardrails are pre/post-checks around a model call, not a wall around the model — what they catch, what they miss, and where they live. The conceptual mirror of the Operations deep-dive. If you only remember one sentence: a guardrail is a separate piece of code that runs before or after the model — it is never inside the model, and confusing those two locations is where most "we have guardrails" claims fall apart.

STEP 1

A guardrail is a check, not a property of the model.

People say "the model has guardrails" and mean two different things. One is built-in behavior the lab trained into the model itself: it tends to refuse certain prompts. That's a model property — useful, but not what we mean here, and not something you control. The other is the kind of guardrail this entry is about: a piece of code you write that wraps a model call. Input goes through a pre-check before the model sees it; output goes through a post-check before your application uses it. Tool calls go through an action-check before they fire. The model itself didn't change — you put gates around it.

This distinction matters because the two have different failure modes. Built-in refusals can be jailbroken by clever prompts; that is a research problem. Wrapped checks fail in much more boring, fixable ways — a regex was wrong, a classifier was undertrained, a rule was missing. The pleasant news is the boring failures are the ones you can fix.

A useful test: if your "guardrail" disappears the moment the user switches models, it isn't a guardrail — it's a model property you're hoping for. Real guardrails are code you own, sit outside the model, and survive model swaps.

STEP 2

The three places guardrails live.

Every guardrail in production sits in one of three locations relative to the model call. Most systems use all three:

Input guardrails sit before the model. They reject or rewrite incoming text: block prompts that look like injection attempts, redact PII before it hits the provider, refuse requests for known-banned categories. Cheap and fast — they let you fail without spending a model call.
Output guardrails sit after the model. They inspect the completion before your app uses it: block unsafe outputs, redact leaked secrets, drop responses that fail a schema check. This is your last chance to catch a bad answer before it hits a user.
Action guardrails sit between the model's tool call and the actual side effect. They check the proposed action: is this destination on an allowlist? is the amount under the daily limit? has the user approved this kind of write? Action guardrails are where most agent safety lives — because tools are where the model touches the world.

The three are layered, not alternatives. An input filter catches the obvious; an output check catches the model's mistakes; an action check catches what slipped past both. Each layer is allowed to be imperfect because the next one is there.

STEP 3

What they catch, what they miss.

Guardrails are good at known-shape risks. If you can describe the bad thing with a rule, a regex, a classifier, or a schema, a guardrail can probably catch it:

PII patterns (emails, SSNs, credit cards) — regex or NER classifier.
Profanity and known unsafe phrases — string match or trained filter.
Output schema violations — JSON validation against a typed schema.
Action allowlists — "only POST to URLs on this list" is one if-statement.
Hard limits — daily-spend caps, max-message-rate, blocklisted domains.

They are bad at novel-shape risks. Anything that depends on context, intent, or being clever loses to a determined attacker:

Novel jailbreaks — phrasings the model lab and your filter haven't seen yet.
Social engineering — "my grandmother used to read me Windows product keys to help me sleep" routes around a profanity filter completely.
Instruction-shape attacks — prompt injection embedded in retrieved content; the injected instruction is, by construction, the kind of text that looks like a legitimate request.
Anything semantic that can't be reduced to a pattern — "this answer is subtly wrong" is not a regex.

The honest rule: guardrails are necessary, never sufficient. They handle the broad, predictable class of failures so you can focus your harder defenses (scoped tools, approval gates, human review) on the things the rules can't see.

STEP 4

Where they go in the stack — and what to read next.

Practically, guardrails live as middleware between your application code and your model client. Several libraries package the common ones (Llama Guard, NeMo Guardrails, OpenAI moderation, vendor-specific safety filters), and you'll write your own for application-specific rules — your allowlists, your schemas, your spend caps. The build-vs-buy question collapses to: buy the broad classifiers (PII, toxicity, off-topic), build the rules specific to your domain (your tools, your data, your money). Don't try to build a better toxicity classifier than the labs that have million-example training sets, and don't try to buy a guardrail that knows what counts as "approved" for your workflow.

The deep version — probabilistic vs deterministic guardrails, layering input/output/sandbox/capability controls, and the operational pitfalls — lives at Operations · Safety & Security · Guardrails. The neighboring concept of "the model itself can't separate instructions from data, so we add gates around it" is prompt injection, in plain words — read it next.