0.2
Part 0 / Foundations · The discipline, focused on what matters for agents

Prompt engineering for agents: opinionated, structural, measured.

Prompt engineering is one of the most-written-about topics in the field, and most of what's written is general ChatGPT advice that doesn't transfer to agent work. This chapter teaches the agent-specific practice: what your system prompt is actually doing (defining the agent's identity, encoding workflow discipline, setting refusal posture), how to use examples without calcifying behavior, when structural formatting (XML-style tags) earns its place, and the iteration discipline that turns prompts from "vibes" into measurable engineering. The chapter assumes you've absorbed dozens of system prompts already from earlier chapters — its job isn't to repeat that material but to surface the underlying principles so you can write good agent prompts deliberately.

STEP 1

The three things a system prompt has to do.

Most advice on prompt engineering treats "the prompt" as one object — write it well, the model performs better. For agents this framing is too coarse. An agent's system prompt is doing three distinct kinds of work simultaneously, and each kind has its own discipline. Confusing them — trying to encode workflow in identity, or refusal in examples — produces prompts that are long, brittle, and underperform their length.

The three roles, named:

┌──────────────────────────────────────────────────────────────────┐ │ WHAT AN AGENT SYSTEM PROMPT DOES │ │ │ │ Role 1: IDENTITY / FRAMING │ │ Who is the agent? What's its purpose? What domain does it work │ │ in? This sets the cognitive frame for everything the model │ │ produces. │ │ │ │ Role 2: WORKFLOW DISCIPLINE │ │ How should the agent approach its work? What's the order of │ │ operations? When does it search, when does it verify, when │ │ does it stop? This is the operating procedure. │ │ │ │ Role 3: SAFETY / REFUSAL POSTURE │ │ What does the agent refuse? When does it escalate? What │ │ guardrails apply? This is the boundary of acceptable behavior. │ └──────────────────────────────────────────────────────────────────┘

The trap: most "improvements" to a system prompt accidentally pile new content into the wrong role. A regression in workflow gets fixed by adding rules to the identity section. A missed refusal gets fixed by adding workflow steps. Over months, the prompt becomes a 3000-token mess where no role is clean. The discipline is to keep the roles distinct and edit each on its own terms.

Role 1: identity and framing

The shortest section in the prompt, and the most powerful. The model behaves dramatically differently depending on what frame it adopts. "You are an AI assistant" and "You are a senior backend engineer who has worked on Postgres internals for ten years" produce different behavior on the same Postgres question, even with no other instruction changes.

What belongs in identity:

  • The role. What kind of agent. "Customer support agent for a SaaS billing system." "Research assistant specializing in semiconductor industry analysis." "Code reviewer for our open-source Rust project."
  • The domain. What field the agent operates in, with enough specificity that the model knows which conventions apply. "Customer support for a B2B SaaS where customers are enterprise admins, not end-users."
  • The audience. Who the agent's outputs serve. "Responses will be read by paying enterprise customers; tone should be professional, not casual."

What doesn't belong here, but often shows up by mistake: workflow steps ("always search the docs first"), refusal rules ("never share API keys"), output formatting ("respond in JSON"). These are the next two roles' jobs. Mixing them in identity bloats the section and obscures the framing.

A well-structured identity section is typically 50–200 tokens. If yours is longer, you're almost certainly conflating roles. Concrete example:

# Identity section — clean

You are a research analyst at a financial-services firm.  You produce
analytical briefings on companies, industries, and market events for
internal use by investment professionals.  Your readers are sophisticated
financial analysts who care about specific evidence and verifiable claims,
not high-level summaries.

That's about 60 tokens. It tells the model: who you are (analyst), what context (financial services), what you produce (analytical briefings), for whom (sophisticated analysts), with what bar (specific evidence, verifiable claims). Every word does work; nothing is decorative.

Role 2: workflow discipline

The bulk of a well-designed agent system prompt. This is where the agent's operating procedure lives — how it approaches its work, in what order, with what verification.

What belongs in workflow:

  • Sequencing rules. "Before answering, verify the user's account status using get_account." "After making any change, run the test suite." The model learns the order of operations.
  • Verification expectations. "If you cite a fact, the citation must reference a source you've actually retrieved." "Before declaring the task complete, confirm that the test suite passes."
  • Stop conditions. "When you've answered the user's question with citations, stop. Do not continue to elaborate." "If the answer requires more than 5 tool calls, escalate to human review."
  • Tool-use guidance. "Use search_docs for product-specific questions and web_search only for industry-wide questions." Same-toolkit ambiguity resolution.

The pattern: imperatives describing what the agent does in different situations. Each rule should be specific, actionable, and verifiable. Vague workflow rules ("be thorough") produce nothing measurable; specific ones ("if the user's question references a feature, look it up before answering") shift behavior.

This is also where chapter 4.1's "verification ladder" gets encoded for code agents, where chapter 4.3's effort-scaling rules get encoded for research agents, where chapter 4.2's verification-by-observation gets encoded for computer use. The chapters that taught those patterns were teaching what to put in workflow.

# Workflow section — sketch for a research analyst agent

When the user asks for analysis on a company, industry, or event:

1.  Decide whether the question can be answered from your training data
    alone, or whether it requires current information.  Anything about
    quarterly results, recent news, ongoing events, or current market
    state requires fresh sources.

2.  If fresh sources are required: search for them using the available
    tools.  Prefer primary sources (SEC filings, company press releases,
    earnings call transcripts) over secondary aggregators.

3.  Every substantive claim in your output must include a citation
    to a source you've actually retrieved.  Do not produce claims from
    training data without sourcing them; if you can't find a source,
    omit the claim.

4.  Structure outputs as: brief summary (2-3 sentences) → key findings
    with citations → caveats and gaps.  Do not add executive-style
    formatting (headings, bullet lists with bold) unless the user
    explicitly requested it.

5.  When you've answered the user's question with appropriate citations,
    stop.  Do not continue with related topics they didn't ask about.

This is the section that gets long. 5-15 numbered rules is typical; more than 20 is usually a sign the workflow is too complex and the agent should be decomposed (chapter 4.4's patterns).

Role 3: safety and refusal posture

The section that catches what shouldn't happen. Shorter than workflow, longer than identity, and the section most often missing entirely on first drafts.

What belongs in safety / refusal:

  • Explicit refusal patterns. What requests the agent declines, and how. "If asked to make investment recommendations (buy/sell/hold), respond with: 'I provide analysis, not recommendations. Here's the analysis...' and continue with analytical content."
  • Escalation triggers. When the agent stops trying to handle things itself and hands off. "If the user expresses dissatisfaction or repeated frustration, do not continue troubleshooting; respond with: 'Let me connect you with a human agent.'"
  • Data-handling rules. What the agent doesn't include in outputs. "Never include API keys, passwords, or other credentials in responses, even if they appear in tool results. Reference them by name only."
  • Authoritative position on contested behavior. "You do not have access to real-time data unless explicitly given a tool for it. Do not invent timestamps or 'current' values."

The discipline: be specific about what triggers refusal and what the refusal looks like. Vague ("don't be harmful") produces nothing measurable; specific ("if asked to provide investment recommendations, decline using the exact phrasing X") produces consistent behavior.

# Safety section — sketch

You decline the following:

— Investment recommendations.  If asked "should I buy X?" or "is X a
  good investment?", respond: "I provide analysis, not investment
  recommendations.  Here's analytical context to inform your decision:"
  and continue with substantive analysis.

— Predictions of specific market movements.  "Will X stock go up?" is
  declined the same way — provide analytical context, not prediction.

— Confidential information about the firm's positions or clients.  If
  the user appears to be asking about internal positions or client
  holdings, respond: "I can't access confidential firm or client data."

When you decline, do not lecture the user.  Decline briefly and move
to substantive analytical content where appropriate.

The last sentence is doing important work. Many refusal patterns produce outputs that are mostly lecture and barely substantive — "I can't make recommendations because... [paragraphs of caveat]". Users hate this. The pattern that scales: short decline + immediate redirect to useful content. Worth specifying explicitly in the prompt.

The order matters: identity → workflow → safety

In what order do these sections appear in the actual prompt? The convention that works:

  1. Identity first. Frame the model before giving it operating rules. The rules land differently depending on the frame.
  2. Workflow second. The body of how-to-operate guidance.
  3. Safety last. The override conditions on the workflow. These take precedence over workflow when triggered, so they go after the workflow they override.

This ordering also matches the chapter 0.1 mental model: critical content sits at start and end of context (lost-in-the-middle effect). Identity and safety — the most important framings — are at the boundaries. Workflow, the bulk, sits in the middle where some attention dilution is acceptable.

If your system prompt is more than ~1500 tokens, audit the three roles. Most overlong prompts are bloated in workflow (too many rules) or have safety material scattered through workflow instead of in its own section. Cleaning up structure often cuts 30-40% with no behavior loss — and the leaner prompt costs less (chapter 2.2) and resists the lost-in-the-middle effect better.

Question
What about output format instructions ("respond in JSON")? Which role does that belong to?

Workflow, usually — it's an aspect of "how the agent produces output," same category as "verify before answering." Put it at the end of the workflow section, near where the agent generates the final response.

One exception: if the output format is part of the agent's identity ("you are a JSON-emitting structured-data extraction tool"), it can go in identity. But for most agents, the role isn't "produce JSON" — it's "be a research analyst who happens to format outputs as JSON because that's what the consumer needs." Workflow.

Avoid making the output-format instruction the only line in any section. A whole "Output format" section with one line is a code smell — usually it means the rest of the structure isn't quite right.

Question
My team has been editing the same system prompt for six months. It's now 4000 tokens. Is that a problem?

Probably yes. Long prompts have three structural problems: lost-in-the-middle (chapter 0.1) reduces the model's effective attention on instructions buried in the middle; token cost adds up (every request pays for every token unless cached); and prompt rot is real (rules that were important six months ago may now contradict newer rules).

The audit pattern: take the prompt, label each rule by which role it belongs to (identity/workflow/safety), and group them. The grouped version reveals:

  • Duplicate rules saying the same thing in different words.
  • Rules in the wrong section (workflow rules in identity, etc.).
  • Obsolete rules from old failure modes that aren't relevant anymore.
  • Rules that should be tool descriptions, not system-prompt content.

Most teams that do this audit cut 30-50% of their prompt without behavior loss. Some teams cut more. The exercise is uncomfortable because rules feel load-bearing — but most of them aren't.

Question
When does a rule belong in the system prompt vs in a tool description?

The clean test: if the rule is about how to use a specific tool, it belongs in the tool's description. "Use get_account with the customer's email" goes in get_account's description, not the system prompt. The model sees tool descriptions every turn the tool is available; rules colocated with the tool stay close to the decision point.

Rules in the system prompt should be about behavior across tools or independent of any specific tool — workflow sequencing, refusal posture, output formatting. The narrower the rule's scope, the closer it should live to what it applies to.

STEP 2

Examples that actually teach (and ones that don't).

"Use examples in your prompt" is the most universal piece of prompt-engineering advice in the field, and it's both correct and dangerously underspecified. Examples can teach the model a behavior more reliably than any amount of description — and they can also calcify the model onto specific patterns at the cost of generalization. The discipline is knowing when to reach for examples and what shape they should take.

Why examples work, mechanistically

The model is a next-token predictor (chapter 0.1). When you put an example in the prompt, you're effectively saying "in similar contexts, this is the kind of continuation that follows." The model's distribution shifts toward continuations that look like the example. This is more direct than description — "be concise and use citations" is one statement; an example showing concise output with citations is many implicit statements about token-level patterns.

This mechanism is the source of both the power and the trap. The power: examples encode complex patterns (tone, format, level of detail) more efficiently than rules can. The trap: examples encode specific patterns, and the model may overfit to specifics that weren't meant to be normative ("the example used commas; I should always use commas").

When examples help

Four categories of agent task where examples reliably improve behavior:

Output format with non-obvious structure. If you want the agent to produce output in a format that can't be fully captured by a schema (e.g., a particular markdown shape, a specific citation style, a tone calibration), one or two examples teach this faster than paragraphs of description. The model needs to see the output, not just read about it.

Distinguishing cases that look similar. If the agent has to handle three similar-looking request types differently (refund request vs. cancellation request vs. plan change), one example of each clarifies the boundaries. Description alone leaves edge cases ambiguous; concrete examples define the boundary by demonstrating.

Refusal patterns. Showing exactly how to decline a request (and what to say instead) is more reliable than describing the refusal pattern. An example refusal that includes the exact phrasing your team wants is encoded once and reproduced consistently.

Multi-step reasoning shape. If you want the agent to show its work in a particular shape (consider X, consider Y, weigh them, conclude), an example demonstrates the shape. Description ("think step by step") is too vague to land consistently.

When examples hurt

Three categories where examples produce worse results than description alone:

When the task is well-described by general rules. "Respond in JSON" with a schema is a clearer instruction than "Here's an example JSON response: {...}". The schema is the abstraction; the example is one instance. The model needs the abstraction.

When the example space is large and your examples don't cover it. If you give one example of a "good response" for a free-form task, the model overfits to that example's specifics (tone, structure, length) even when those specifics aren't what the task needs. Better to describe the qualities you want and let the model generate variety.

When the example might leak into the response. A surprisingly common failure: the model includes content from your example in its actual response. "The previous example was about Acme Corp, and the user's question is about Beta Corp" — and the response is about Acme. This is more likely when the example is recent in the context and the user's actual query is brief.

The "show, don't tell" failure mode

The most subtle pitfall: examples can teach the wrong lesson. The model learns from the example's surface features, not the abstract pattern the example was meant to demonstrate. You meant "respond with citations"; the model learned "respond with this specific URL structure I saw in the example." You meant "be concise"; the model learned "use exactly this many sentences."

The fix: vary examples deliberately to teach the abstraction, not the surface. If you want to show "responses include citations," give two examples with very different citation styles (one URL-based, one footnote-based) so the model learns "include citations" rather than "use my specific style." If you want to show "be concise," vary the lengths slightly so the model learns "match the question's depth" rather than "always 3 sentences."

Concrete: bad example pattern teaches surface, good example pattern teaches abstraction:

# Bad: one example teaches surface details

Here's an example response:

User: What's a good Postgres index strategy?
Assistant: For most Postgres workloads, B-tree indexes on
high-cardinality columns are your default.  Avoid indexes on
low-cardinality columns since the planner often won't use them.
[Source: postgres.org/docs/indexes]

# Problem: the model now thinks ALL responses should be 3 sentences,
# end with [Source: ...] inline, and use this specific cadence.


# Better: two examples vary deliberately

Here are two example responses:

User: What's a good Postgres index strategy?
Assistant: For most Postgres workloads, B-tree indexes on
high-cardinality columns are your default — covered well in the
official indexes documentation [1].  Avoid indexes on low-cardinality
columns since the planner often won't use them.

[1] postgres.org/docs/indexes

User: How do I diagnose slow queries?
Assistant: Start with `EXPLAIN ANALYZE` on the slow query — it
shows the actual execution plan with timing per node.  Look for
sequential scans on large tables, hash joins where merge joins
would work, and significant time in sorting.  The Postgres docs
have a good walkthrough of reading plans <cite>1</cite>.

<cite>1</cite> postgres.org/docs/using-explain

# Both use citations, both are concise, both cite postgres.org —
# but the surface differs.  The model learns "include citations"
# and "be concise" without overfitting to a specific style.

Output schema as the cleanest "example"

For structured-output tasks, the cleanest form of "showing the model what to produce" isn't a free-form example — it's a schema. Tool descriptions (chapter 0.3) carry their own schemas; structured output features in modern APIs let you constrain JSON-shaped outputs to a defined schema directly. Both leverage the same insight: let the model see the shape it's producing, formally.

A schema is better than a free-form example for structured outputs because:

  • It captures the abstract pattern (these fields, these types) without the surface noise (specific values that the model might copy).
  • It's enforced — the API rejects outputs that don't match. Examples are aspirational; schemas are binding.
  • It's documentation — the schema itself describes the contract, instead of requiring the reader to infer it from examples.

When the output is genuinely free-form (a research summary, an email reply, a code-review comment), examples are appropriate because no schema captures the shape. When the output is structured, prefer schemas and reserve examples for the parts a schema can't capture (tone, level of detail, specific patterns within a string field).

How many examples is the right number

The pragmatic guide, learned from many production agent prompts:

0 examples is the right answer when description alone is clear and the task is well-defined ("classify this support ticket into one of these 5 categories: ..."). Don't add examples reflexively.

1 example is usually too few — the model overfits to that single example's surface. The exception: when you're showing one specific thing (like a refusal pattern with exact phrasing) where you do want that exact surface reproduced.

2-3 examples is the sweet spot for most agent tasks. Enough variety that the model learns the abstraction; few enough that the prompt doesn't bloat.

5+ examples usually means you're trying to compensate for a different problem — unclear instructions, contradictory rules, or a task that's genuinely too varied for a prompt to capture. The fix isn't more examples; it's clearer rules or task decomposition.

Question
What about chain-of-thought examples — showing the model reasoning through a problem step by step?

Less useful with modern reasoning-capable models than with older ones. Pre-2024, showing the model a few examples of "first I'll think about X, then Y, then conclude Z" measurably improved hard-reasoning task performance. With current reasoning-mode models (extended thinking, GPT-5 reasoning), the model does this internally; the chain-of-thought examples in the prompt are redundant.

The remaining good use of CoT examples: when you want the model to expose its reasoning in the visible output, not just use it internally. For analysis tasks where the user wants to see the reasoning, an example showing the explicit reasoning structure helps. For tasks where you only care about the conclusion, modern models don't need the in-prompt CoT.

Question
Should examples come before or after the instructions?

Instructions first, examples after. The instructions provide the abstract pattern; the examples illustrate it. Reversing this — examples first, then "and here's why" — often leaves the model trying to figure out what the examples are demonstrating before it has the framing.

One specific exception: when the task is best understood through examples (e.g., a very specific output format that's hard to describe), leading with one example before the formal instruction can help anchor the model. The instruction then becomes "produce more outputs like the one above." Reserve this pattern for cases where description genuinely struggles.

Question
My agent's prompt has 10+ few-shot examples. Are they still helping?

Probably not, at that count. Three reasons. First: each example consumes substantial tokens, and a 10-example block can easily run 3-5K tokens — half a budget. Second: lost-in-the-middle (chapter 0.1) means examples buried in the middle of a big block get attended to less. Third: when models have many examples to "average," they often produce a generic-feeling synthesis rather than crisp pattern-matching to any specific one.

The audit: read through your examples and ask which ones uniquely teach something the others don't. Usually 2-3 do that work; the rest are decorative. Cut the decorative ones, keep the load-bearing ones, and reclaim 80% of the tokens with no quality cost.

STEP 3

Structured prompts: XML, sections, and when structure earns its place.

Anthropic's published prompt-engineering guidance explicitly recommends using XML-style tags to structure long prompts. This isn't arbitrary syntactic preference — it's grounded in how Claude was trained and the patterns of structural marking that the model attends to most reliably. The same principle generalizes: visible structure in long prompts helps the model navigate them, regardless of which tags or markers you use. This step covers what structure to add, when it earns its place, and where over-structuring becomes its own anti-pattern.

Why structure helps

A long prompt without structure is a wall of text. The model has to identify what's a rule, what's an example, what's context, what's the user's question — all from positional and lexical cues. Visible structure does this work explicitly: this section is identity, this section is workflow, this section is examples. The model navigates to the relevant section when needed instead of treating everything as a uniform stream of instructions.

The mechanism is the same as for human readers reading documentation: when content is organized with section headers and clear delimiters, you scan for what you need rather than reading linearly. The model does something analogous internally — attention attends more strongly to content that's structurally marked as belonging to a coherent unit.

XML-style tags, the Anthropic convention

Claude responds particularly well to XML-style tagging — a convention that emerged because the training data included structured documents using this style. The pattern:

# Structured agent system prompt

<identity>
You are a research analyst at a financial-services firm.  You produce
analytical briefings on companies and industries for investment
professionals.
</identity>

<workflow>
For every analytical question:

1.  Decide whether fresh sources are needed.
2.  If yes, search and fetch primary sources before synthesizing.
3.  Every substantive claim must include a citation.
4.  Structure outputs as: brief summary → key findings → caveats.
5.  Stop after answering; do not continue with related topics.
</workflow>

<safety>
You decline:

- Investment recommendations.  Respond with analysis, not buy/sell guidance.
- Specific market predictions.
- Anything about firm-internal positions or client holdings.

When declining, be brief and redirect to substantive analysis where
appropriate.
</safety>

<output_format>
Outputs follow this shape:

- Brief summary (2-3 sentences)
- Key findings (bulleted list with inline citations)
- Caveats (any gaps, conflicting sources, or uncertainty)
</output_format>

The structure isn't just aesthetic — it makes the prompt easier to maintain (clear places to edit), easier to debug (when behavior drifts, you know where to look), and reliably navigable by the model.

Tag names matter

One subtle point: the names of the tags you choose carry semantic weight. <workflow> signals to the model that what follows is procedural; <safety> signals that what follows takes precedence over workflow when triggered; <output_format> signals that what follows is binding on the response shape. Picking informative tag names is a small lever that produces measurable improvements.

Recommended naming patterns:

  • <identity> or <role> for the framing section
  • <workflow>, <procedure>, or <guidelines> for operating rules
  • <safety>, <refusals>, or <guardrails> for boundary conditions
  • <examples> with nested <example> for few-shot material
  • <output_format> or <response_shape> for output structure rules
  • <context> for background information the agent should know but isn't itself a rule

The tag should describe what the content is, not where it sits in the prompt. <section_2> tells the model nothing; <safety> tells it everything.

Structure within sections: numbered lists and nested tags

Inside a section, structure also helps when there are multiple items. The two patterns that work well:

Numbered lists for sequenced rules. "1. First, do this. 2. Then, do this." Numbering signals sequence, which the model uses to maintain order. This is the right pattern for workflow rules where order matters.

Nested tags for parallel items. When you have several refusal patterns to encode, each with its own description, nested tags help:

<refusals>
  <refusal type="investment_advice">
    Trigger: user asks should-I-buy / is-X-a-good-investment
    Response: "I provide analysis, not investment recommendations."
              then continue with substantive analysis.
  </refusal>
  <refusal type="market_prediction">
    Trigger: user asks will-X-go-up / what-will-happen-to-Y
    Response: Same redirect to analytical content.
  </refusal>
</refusals>

Each refusal is a discrete unit; the model can apply them independently. Without the nested structure, the same content as a paragraph blurs the boundaries between them.

When structure stops earning its place

Over-structuring is its own anti-pattern. Three signs you've gone too far:

Sections with one item. A <identity> with one sentence inside. An <examples> tag wrapping a single example. The wrapping adds tokens without adding clarity — the content is self-evidently a unit; the tag is redundant.

Deeply nested structures. Three or more levels of nesting in a prompt usually signals that the structure has become baroque. The model attends to the top level; deep nesting gets ignored or treated as decoration. Keep nesting to two levels at most.

Structure for short prompts. A 200-token prompt doesn't need section tags. The whole prompt is one section. Adding <identity> and <workflow> to a tiny prompt is cargo-culting the technique. Reach for structure when the prompt is long enough to need navigation — typically 500+ tokens.

Structure in agent vs general-purpose prompts

The structural patterns above are stronger for agent prompts than for general-purpose model prompts. Two reasons. First, agent prompts are typically long-lived — they're checked into version control, edited over time, debugged. Structure helps the humans who maintain them as much as the model that reads them. Second, agent prompts encode behavior that has to be consistent across many sessions. Structure makes the behavior contract explicit.

For one-shot prompts (a single query against a model for a single output), the structural overhead often isn't worth it. The prompt is read once, the response is produced, both are discarded. Spending tokens on tags when the prompt is 200 tokens long is silly. Structural discipline pays off when the prompt is a piece of code — used many times, edited often, debugged across versions.

The middle-of-document anti-pattern

One specific pattern to avoid: burying critical instructions in the middle of a long prompt. From chapter 0.1, the lost-in-the-middle effect means content placed there gets attended to less reliably. Within a structured prompt, this means: put critical instructions either at the top of their section or at the very bottom of the whole prompt (the last thing the model sees before the user's query).

The pattern: a long workflow section can end with a brief summary of the most critical rules, ensuring they appear close to the response generation. Or the prompt can end with a brief <remember> section that restates the 2-3 most important rules. Redundant in content; load-bearing structurally.

... (long workflow section here) ...

<remember>
Two things to keep in mind for every response:

1.  Every substantive claim needs a citation.  If you don't have a
    source, omit the claim.
2.  Decline investment recommendations briefly and redirect to analysis.
</remember>

This is the prompt-engineering equivalent of writing a one-paragraph summary at the end of a long document. The summary doesn't add new information; it ensures the most important content is positioned where attention is highest.

The single most common prompt structural mistake: putting the user's actual query before the system instructions in a system-prompt-less setup. The "user message" should be the user's input; the system prompt should be everything else. Anthropic's system parameter (and OpenAI's equivalent) exists exactly for this — it's not just a parameter name, it's a separate position in the model's input that gets treated differently. Use it.

Question
XML tags feel heavyweight. Can I use markdown headers (## Workflow) instead?

Functionally similar for most cases, with one caveat. Anthropic's training specifically reinforced attention to XML-style tagging; markdown headers work but somewhat less reliably for Claude specifically. For OpenAI models, both work about equally. For other providers, behavior varies.

The pragmatic choice: XML for Claude-targeted prompts, markdown for cross-provider prompts where you want the structure to work everywhere. Both are far better than no structure; the difference between them is small enough that style preferences are reasonable to honor.

One reason XML wins for Claude specifically: nesting. Markdown headers don't nest cleanly (you have to use header levels: ##, ###, ####). XML nests naturally with proper open/close. For complex prompts with multiple levels, XML stays cleaner.

Question
Should I close every tag, or are unclosed tags okay?

Close them. Unclosed XML-style tags work most of the time but degrade in edge cases — the model occasionally treats the entire remainder of the prompt as belonging to the unclosed tag, or interprets the next opening tag as nested. Close every tag you open; the discipline is cheap and prevents subtle drift.

Question
What about user-provided content I want to insert into the prompt? Wrapping it in tags?

Yes — wrap untrusted content in a clearly-named tag so the model knows where it starts and ends. <user_document>{content}</user_document> tells the model: this is content from the user, not instructions to follow. Helpful both for parsability and for prompt-injection resistance (chapter 2.3) — the model is less likely to follow instructions embedded in content that's clearly marked as data.

This isn't bulletproof against determined prompt injection — explicit framing helps but the model can still be manipulated. The wrapping is a partial defense, not a complete one. Chapter 2.3's broader injection-resistance discipline applies.

STEP 4

Iteration discipline: turning prompts from "vibes" into engineering.

The biggest failure mode in prompt engineering isn't choosing the wrong words — it's editing prompts without measuring whether the edits actually helped. "Vibes prompting" — adjusting prompts based on intuition about what should work, shipping the version that feels better — is how teams end up with 3000-token system prompts that perform worse than the original 800-token version. This step is about the iteration discipline that turns prompt work into measurable engineering.

The vibes-prompting trap

The pattern that catches every team: someone notices the agent doing something wrong, edits the prompt to address it, eyeballs a few test runs that look better, ships. A week later, someone else notices a different problem, edits the prompt again, ships. Three months in, the prompt is twice as long as it started, and nobody can tell whether overall quality is better or worse — they only know it's "better" on the specific issues each edit was meant to fix.

The problems compound:

  • Each edit might fix its target while breaking something else. Without measuring across many scenarios, you don't see the trade-offs.
  • Edits accumulate without ever being removed. Old rules from old failure modes stay in the prompt forever; the prompt grows monotonically.
  • The team's belief about what the prompt does diverges from what it actually does. Members remember the edits they made and assume the rules are in force; in fact, contradictory edits cancel each other out.
  • Cost grows without anyone noticing. Every additional token costs real money on every request, and a 3000-token system prompt at production scale is meaningful spend.

The fix is the same discipline chapter 3.1 taught for evals broadly, applied specifically to prompts: predict, measure, decide.

Predict-then-measure for prompt changes

The discipline, applied to any prompt edit:

1. Name the change and its target. "Add 'cite primary sources over aggregators' to workflow section. Target: shift citation source-quality distribution from ~60% primary to ~80% primary."

2. Predict the effect on metrics. "I expect source_quality metric to improve by 5-10 points; I don't expect task_completion or answer_quality to change meaningfully."

3. Run the eval. Layer 2 fast subset for cheap changes, full suite for significant ones (chapter 3.2's two-tier cadence). Multi-run for noise-prone metrics.

4. Verdict against prediction. Did the change move the metric you predicted, by approximately the amount you predicted? Did anything else move that you didn't predict?

5. Act on the verdict. If the change worked as predicted, merge. If it moved less than predicted, dig into why. If it moved something else unexpectedly, the change has side effects you need to understand before shipping.

This sounds heavy, but it's the same five-step process from chapter 3.1 — applied to prompt changes specifically. Most prompt edits in production agents should go through this discipline; "vibes" edits should be reserved for trivially-scoped changes (typo fixes, comment edits, format cleanup that can't affect behavior).

When to A/B test prompts in production

For changes where Layer 2 evals show small or ambiguous effects but production performance might differ, A/B testing in real traffic is the right tool. The pattern: route 5-10% of traffic to the new prompt, the rest to the existing one, compare Layer 3 metrics over a defined window.

When this is worth the operational complexity:

  • The change is genuinely large (new section, significant rule shift, model swap on the prompt) and could have broad effects you want to verify on real traffic.
  • The eval suite returns "no significant effect" but you believe the change matters — production traffic is the deeper test.
  • The change affects an aspect of behavior that's hard to measure in offline evals (user satisfaction, conversation continuation rates, retention).

When it's overkill:

  • Small clarifying edits or typo fixes. Just ship.
  • Changes where eval signal is already clear in either direction. Use the eval signal.
  • Changes where the metric you care about is well-measured offline. Offline is faster and cheaper.

A/B testing prompts has the same machinery as A/B testing any feature — flag-driven routing, statistical-significance discipline, defined run windows. Don't reinvent it; use your existing experimentation infrastructure if you have one.

The prompt-maintenance discipline

Prompts have lifecycle issues that traditional code doesn't:

Rules from old failure modes accumulate. Three months ago you added "always verify the user's account status before proceeding." That rule made sense when the failure mode was missing-context errors. The agent has improved; the rule is obsolete; but nobody removes it because removing it would feel risky. After a year, the prompt has dozens of these.

The fix: quarterly prompt audits. Take the system prompt, list every rule, ask of each: "If we removed this, would the eval suite get worse?" If no, the rule is doing no work and can be removed (with confirmation by running the eval suite with and without the rule). This is exactly the kind of audit chapter 3.2's eval discipline supports — you can verify the removal is safe before shipping it.

Conflicting rules accumulate. Edit A says "always X." Edit B (three months later) says "in case Y, do not X." The two rules don't reference each other; the model resolves the conflict ad-hoc each turn. Over time, the prompt becomes a forest of partially-contradictory rules. The audit pattern is the same: review explicitly for conflicts; resolve them by replacing the conflicting rules with a unified rule.

Examples become stale. An example that used the company's old product name. An example that references a deprecated API. These accumulate the same way rules do; the same quarterly audit catches them.

Version control as prompt infrastructure

Treat prompts as code, not configuration. They go in version control with the rest of the codebase. Changes go through code review. The history of changes is auditable. The commits that change the prompt have meaningful messages — "Add explicit refusal pattern for refund requests, target +5 to compliance metric."

This isn't ceremony; it's the substrate that makes everything else possible. Without it: prompts live in a CMS or a config file that nobody reviews, changes go undocumented, and "who added this rule and why?" is unanswerable two months later.

The mature pattern: prompts in version control, with a CI step that runs the Layer 2 eval suite when prompt files change. The same eval-driven discipline that gates code changes gates prompt changes. They're the same kind of artifact.

WORKED EXAMPLE

Tuning a tool-use prompt from "kind of works" to "ships."

To anchor everything in this chapter: a real-shape prompt-engineering session, traced through the predict-measure-decide cycle. The agent is a customer-support agent (chapter 4.4's shape); the tool in question is the routing-decision tool that classifies tickets into specialized agents. The starting prompt works at 71% accuracy on the routing eval; the target is 85%. Three iterations to get there.

The starting prompt

You are a customer support router.  Given a customer's question,
decide whether to route it to billing, technical support, or
account management.  Use the route_ticket tool with your decision.

Layer 2 fast subset on routing: 71% accuracy on the 20-query routing eval set. The misclassifications cluster around three patterns: technical questions about paid features routed to billing (because "paid" triggered the billing classification); compound questions routed inconsistently; ambiguous questions that should have been escalated to human review getting routed somewhere instead.

Iteration 1: add explicit routing rules

Prediction. Adding 3 explicit decision rules (one per failure pattern) should improve routing accuracy by 8-12 points. Don't expect to fully fix all three with this single pass.

The new prompt:

<identity>
You are a customer support router.  You classify incoming tickets
into one of three categories and dispatch to the right specialist.
</identity>

<workflow>
For every ticket, decide which category fits best:

—  Billing: refunds, charges, invoices, payment methods, plan pricing.
—  Technical support: how-to questions, error messages, integration
   issues, anything about how the product works.
—  Account management: password resets, user permissions, team
   invitations, account-level settings.

Routing rules:

1.  If a question mentions a paid feature AND is about how to use it,
    route to technical support, not billing.  "How do I set up the
    paid SAML feature?" is technical, not billing.

2.  For compound questions covering multiple categories, route to the
    category that addresses the customer's primary concern.  If
    that's not clear, route to technical support (which handles the
    broadest case).

3.  If the question is too ambiguous to classify confidently after
    one read, route to "human_review" instead of guessing.
</workflow>

Result. Layer 2 fast subset: routing accuracy 79%. An 8-point lift, within the predicted range.

Decomposing the lift: rule 1 fixed all 3 of the "paid feature" misroutes (was 0/3 correct, now 3/3). Rule 3 caught 2 of the 4 ambiguous-question cases (improved escalation). Rule 2 helped marginally but didn't fully solve compound questions.

Verdict. Prediction was directionally right; magnitude was within range. The compound-question pattern is the remaining gap.

Iteration 2: refine the compound-question handling

Prediction. Compound questions need a worked example, not just a rule. Adding 2 examples (one with a clear primary concern, one without) should fix the remaining compound-question misroutes. Expected lift: 3-5 points.

Adding to the prompt:

<examples>

<example>
Ticket: "I was charged twice for last month — also, my login keeps
failing on the mobile app."
Routing: billing (charged-twice is the primary financial concern;
the login issue is secondary and can be addressed after).
</example>

<example>
Ticket: "I'm trying to set up SSO and getting an error, plus I need
to add three new users to my team."
Routing: human_review (compound request with no clear primary; SSO
setup and user provisioning are different specialists' work and
need coordination).
</example>

</examples>

Result. Layer 2 fast subset: routing accuracy 84%. A 5-point lift, at the top of the predicted range.

The compound-question cases are now mostly handled. The remaining failures are subtler — questions phrased ambiguously enough that even a human would need clarification. These map to "escalate to human" which is the rule-3 behavior.

Verdict. Examples did the work rules alone couldn't. We're at 84%, just below the 85% target.

Iteration 3: the last point

Prediction. The remaining 16% of failures are in two subcategories: (a) questions about features the agent doesn't recognize (rare new product surface), and (b) very short questions where the user didn't provide enough context. Both should route to human_review under rule 3, but the prompt doesn't sufficiently emphasize that "unknown feature" or "too-brief question" are escalation triggers, not classification challenges. Adding two more triggers should pick up 1-2 points.

Updating rule 3 to be more specific:

3.  Route to "human_review" instead of guessing when:
    —  The question is too ambiguous to classify confidently
       after one read.
    —  The question references a feature you don't recognize
       (don't guess based on partial matches).
    —  The question is too brief to determine intent (1-2
       short sentences with no context).

Result. Layer 2 fast subset: routing accuracy 86%. Target met.

The prompt grew from ~50 tokens to ~280 tokens across three iterations. Every increment was predicted, measured, and verified. The 15-point accuracy gain pays for the additional tokens many times over in reduced misroute costs.

What this trace teaches

Four observations worth naming:

Predictions get easier with experience. The first iteration was a guess about magnitude; the third was much more calibrated. After running a few cycles, you develop intuition for how much each kind of edit moves which metric. This intuition is worth more than any specific prompt advice — it's the difference between vibes prompting and engineered prompting.

The eval suite is what makes this possible. Without Layer 2 numbers to verify against, none of this works. Vibes prompting persists in teams that lack the eval infrastructure to know whether their edits helped. Investment in Layer 2 (chapter 3.2) is also investment in being able to iterate prompts deliberately.

Each iteration addressed a specific failure pattern. Not "make it better" — "fix the paid-feature misroute." Specific targets produce specific edits; specific edits are measurable. "Make the prompt better in general" is the vibes-prompting trap by another name.

The prompt grew because the work warranted it. Each addition was justified by a measured improvement on the eval set. If iteration 3 had moved accuracy by 0.2 points instead of 2, the right move would have been to revert iteration 3 and accept 84% — not to keep adding rules in hopes of incremental gain. Prompt growth should be justified, not assumed.

The prompt-engineering loop, in one sentence

Find a specific failure pattern. Predict what kind of edit would address it and by how much. Make the edit. Measure. If the verdict matches the prediction, merge. If not, understand why and either revise the prompt or revise your model of what works. Repeat until the eval suite passes the quality bar. Stop when you stop being able to predict the magnitude of edits — that's a sign you've reached the limits of what prompt-engineering alone can do, and the remaining gap requires changes to tools, scaffolding, or model choice.

End of chapter 0.2

Deliverable

A working discipline for writing and maintaining agent system prompts. The three roles (identity, workflow, safety) kept distinct, with each edited on its own terms. Examples used where they teach abstractions and avoided where they calcify surface. Structure (XML-style tagging, named sections) applied to long prompts and skipped for short ones. The iteration loop — predict, measure, decide — applied to every meaningful change. Quarterly audit discipline to prevent prompt rot. Version control as the substrate that makes prompt engineering a real engineering practice rather than a series of guesses.

  • System prompts have explicit identity, workflow, and safety sections, in that order
  • Identity is short (50-200 tokens), framing only, no operating rules
  • Workflow contains sequencing rules, verification expectations, stop conditions
  • Safety section has specific refusal triggers with explicit refusal phrasing
  • Tool-specific rules live in tool descriptions, not the system prompt
  • Few-shot examples used for non-obvious format / similar-case distinction / refusal patterns
  • Examples varied deliberately to teach abstractions, not calcify surface
  • Schemas preferred over examples for structured output
  • XML-style tags structure prompts above ~500 tokens; informative tag names
  • User-provided content wrapped in clearly-named tags (partial injection defense)
  • Every prompt edit goes through predict-measure-decide discipline
  • Prompts in version control; changes go through review; Layer 2 eval on prompt changes
  • Quarterly prompt audit: list every rule, drop the ones not earning their place