The Agentic Threat Model

Deep Dive · Safety, Alignment & Agentic Security

The agentic threat model: why autonomy widens the attack surface.

A chatbot answers a question and stops. An agent reads, decides, calls tools, and acts — often in a loop, often unattended. Every capability you grant an agent is also a capability an attacker can borrow. This essay builds the threat model from first principles so the later defenses have something concrete to defend.

STEP 1

From "answer" to "act": the capability gap

The security properties of a plain language model are modest because its only output is text. The worst a compromised prompt can do is make it say something undesirable. That changes completely once you wrap the model in an agent loop with tools. Now the model's text is interpreted as intent: a function call, a shell command, an HTTP request, a database write. The blast radius is no longer "bad sentence" — it is "whatever your tools can do."

Three properties of agents create the gap that classic application-security thinking does not cover:

Untrusted input becomes control flow. In a normal program, data and code are separated. In an agent, retrieved documents, tool outputs, and user messages all flow into the same context window the model uses to decide its next action. There is no architectural boundary between "data the agent processes" and "instructions the agent follows."
Autonomy removes the human checkpoint. A human reviewing each step is a powerful (if slow) safety control. Multi-step autonomous loops are valuable precisely because they remove that human — which also removes the control.
Composition multiplies trust. An agent that uses three tools, two retrieval sources, and one downstream MCP server inherits the weakest trust assumption of all of them. Attack surface composes; it does not average.

The single sentence to internalize: an agent treats all text in its context as potentially authoritative, and an attacker only needs to get text into that context. Every vector below is a way to do exactly that.

STEP 2

The four ways text gets into an agent's context

Mapping the threat is mostly about enumerating every channel that can deposit attacker-influenced text where the model will read it. There are four, and they need different defenses.

1. Direct input

The user is the attacker and types adversarial text. This is the classic case and, increasingly, the least common one in production incidents — because it requires the attacker to be a user, and because modern models resist crude versions.

2. Retrieved content (indirect)

The attacker plants text in something the agent will fetch — a web page, an uploaded document, a wiki entry, a support ticket. The attacker never talks to the agent. They just need to influence one source the agent trusts. This is the dominant production vector once retrieval is involved.

3. Tool and API results

The attacker controls or influences a tool the agent calls — a third-party API field, a scraped page, a connected MCP server's response. The agent typically treats tool output as fully trusted, so adversarial text in a JSON field is read as instruction.

4. Persisted history and memory

An injection that succeeds once can be written into the agent's memory or conversation log, then fire on a later turn — potentially a different user's turn if storage is shared. By then it looks like the agent's own prior reasoning.

┌──────────────────────────────────────────────────────────┐ │ ATTACKER-INFLUENCED TEXT → SHARED CONTEXT → ACTION │ │ │ │ direct input ─┐ │ │ retrieved doc ─┤ │ │ tool / API ─┼──► one context window ──► tool call │ │ memory/history ─┘ (no trust labels) │ └──────────────────────────────────────────────────────────┘

STEP 3

Modelling impact: what an attacker actually wants

Vectors are how. Impact is why. For agentic systems the realistic objectives cluster into four categories, and a useful threat model names them explicitly so you can rank mitigations against the ones that matter for your deployment.

Data exfiltration. Trick the agent into reading sensitive data it can access and routing it somewhere the attacker controls — an email, an outbound request, a rendered link.
Unauthorized action. Use the agent's tools to do something the attacker could not do directly — issue a refund, delete records, open a pull request, send a message as the user.
Privilege escalation via confused deputy. The agent has more authority than the attacker. The attack consists of getting the agent to wield that authority on the attacker's behalf.
Resource abuse and denial. Loop the agent, exhaust its budget, poison its memory, or drive it into expensive or destructive tool calls.

Write these four objectives next to your agent's actual tool list. For each tool, ask: "if an attacker controlled the inputs to this call, which objective could they achieve?" That single exercise produces a better mitigation backlog than any generic checklist.

STEP 4

Why this is not solved by a better model

It is tempting to assume that a more capable, better-aligned model removes the problem. It reduces some failure rates but does not change the structural issue: the model has no reliable, tamper-proof signal telling it which span of its context is a trusted instruction and which is attacker-controlled data. They arrive as the same token stream. Asking the model to "just ignore injected instructions" is asking it to solve, by judgment alone, a problem the architecture refuses to encode.

This is the central design principle for everything that follows: security controls must live in your code and your architecture, not in the model's good behavior. The model is one defensive layer among several, and the least reliable one. Treat it accordingly.

Question

Isn't this just AppSec with extra steps? Why a new threat model?

It reuses AppSec principles — least privilege, input distrust, defense in depth — but the primitives differ. Classic AppSec assumes you can separate code from data and validate inputs against a grammar. An agent deliberately erases the code/data boundary and accepts open-ended natural language. The principles transfer; the naive techniques (regex an injection away, sanitize the input) mostly do not. The threat model exists to make that gap explicit before you ship.

Question

Where do I start if I already have an agent in production?

Inventory the four input channels and the tool list. Most teams have defended channel 1 (direct input) and ignored 2–4, while granting tools far broader than the task needs. Closing the gap between "tools the agent has" and "tools this task requires" is usually the highest-leverage first move — it shrinks every impact category at once.