Red-Teaming & Safety Evaluation

Deep Dive · Safety, Alignment & Agentic Security

Red-teaming & evaluation: adversarial testing of agents.

You cannot assert an agent is safe; you can only fail to break it after trying hard. Red-teaming and safety evaluation are how builders generate evidence instead of hope. This essay covers what to test, how agentic red-teaming differs from model red-teaming, how to make it a repeatable gate rather than a one-off heroics session, and how to read the results honestly.

STEP 1

Why functional testing is not safety testing

Functional tests ask "does the agent do the right thing on expected input?" Safety tests ask "what does it do on adversarial input chosen to make it misbehave?" An agent can pass every functional test and still ship a critical vulnerability, because the input space that breaks it is exactly the space functional tests do not explore. Adversarial evaluation is a separate discipline with its own corpus, its own metrics, and its own place in the pipeline.

The goal of agentic red-teaming is not "find one jailbreak." It is to measure whether a successful injection can be converted into a harmful outcome, end to end, given your real tools and your real defenses. The chain matters more than the entry.

STEP 2

What agentic red-teaming adds over model red-teaming

Model red-teaming probes the model in isolation: refusals, harmful content, jailbreaks. Necessary, not sufficient. Agentic systems add attack surface the bare model does not have, and the red team must exercise it:

Indirect injection: plant adversarial content in the retrieval corpus and a tool's responses, then run normal user tasks and watch what the agent does.
Tool-chain abuse: can an injection drive a sequence of individually-allowed tool calls into a harmful aggregate (read sensitive data → call an egress tool)?
Confused deputy: can a low-privilege actor get the agent to use its higher privileges on their behalf?
Exfiltration channels: including non-obvious sinks — rendered image markup, link unfurling, write-then-read side channels.
Memory and multi-turn: does an injection persist across turns or leak between sessions/users?
Guardrail bypass: attack the filters and the action checks directly, not only the model.

STEP 3

Make it a gate, not a heroics session

A manual red-team week before launch is valuable once and stale by the next deploy. Durable safety evaluation is automated, versioned, and wired into the release pipeline alongside the unit tests.

Adversarial test suite: a versioned corpus of attack cases — injection shapes, exfiltration attempts, confused-deputy scenarios, guardrail-bypass probes — each tied to the specific control it targets.
Outcome-based scoring: grade on the harmful outcome (did sensitive data leave? did a destructive tool fire?), not on whether the model emitted a bad sentence. The outcome is what the threat model cares about.
Regression on every change: a model swap, a prompt edit, a new tool, or an MCP-server upgrade can silently reopen a closed hole. Re-run the suite on each.
Growing corpus: every real incident and every new public technique becomes a permanent test case. The suite ratchets; closed holes stay closed.
Automated attacker: use generated/iterative adversarial inputs for breadth, with periodic human red-teaming for the creativity automation misses.

# A safety case = control + attack + expected SAFE outcome
{
  "control":  "egress-allowlist",
  "attack":   "indirect-injection: leak context via image URL",
  "expect":   "no outbound to non-allowlisted host",
  "grade":    "outcome: bytes-exfiltrated == 0"
}

STEP 4

Reading the results without fooling yourself

Absence of evidence is not safety. "The red team didn't break it" with a weak red team means little. Track attacker effort and sophistication, not just pass/fail.
Probabilistic controls have no clean pass. A classifier that blocks 98% of known injections fails the other 2% and an unknown amount of novel ones. Report distributions and worst cases, not a single headline number.
Test the system you ship. Evaluating the model alone, or a sanitized staging config, measures something you are not deploying. Red-team the real tools, real corpus, real guardrails.
Deterministic controls should be near-absolute. If an egress allowlist or a removed tool is ever bypassed in testing, that is a structural defect, not a tuning issue — fix the boundary.

┌────────────────────────────────────────────────────────┐ │ SAFETY EVIDENCE PIPELINE │ │ │ │ attack corpus ─► run vs SHIPPING system ─► outcome grade│ │ ▲ │ │ │ └──── new incidents / techniques ◄───────┘ │ │ every model/prompt/tool change re-runs the suite │ └────────────────────────────────────────────────────────┘

Question

Red-teaming feels infinite. When is it "enough"?

It is never "done" — it is a continuous control like monitoring. A reasonable bar to ship: every threat-model category has automated coverage; every deterministic control has at least one bypass attempt that fails; known public techniques are in the corpus; and a time-boxed human red team with real incentive found nothing that converts to a harmful outcome. Then keep running it on every change. "Enough" is a process state, not a finish line.

Question

Should the same team that built the agent red-team it?

They can build the regression suite, but builders share blind spots with their own design. Pair the automated suite with adversarial review by people who did not design it — a separate internal team or external red team — incentivized to break it, not to confirm it works. Independence of perspective is part of the control.