Code Agents — The Agentic AI Field Guide

4.1

Part IV / Specialize · The agent type that changed what agents are for

Code agents: the shape that made agents real.

Code agents — Claude Code, Cursor, Aider, Cline, the Agent SDK — are the agent category that proved the medium. They produce real work-product (code that ships), they verify themselves (tests run or they don't), and they operate in a state-rich environment (the filesystem) rather than the stateless turn-based world of chat. The architecture is different enough from a research or chat agent that techniques transfer poorly without adaptation. This chapter teaches what makes code agents distinct, the action space they actually operate in (filesystem operations, not "write code"), why the verification loop is the whole game, and how Anthropic's Agent SDK and the Skills system fit together. By the end you'll have a working mental model for designing code agents and a clear view of when to build one vs. when to adopt one.

STEP 1

What makes a code agent different.

A chat agent answers questions. A research agent finds information and synthesizes it. A code agent modifies a system. That shift in role changes the architecture in four specific ways that compound — and that distinguish the design problem from everything we've covered so far.

Property 1: Persistent filesystem state

In a chat agent, state lives in the conversation history. The agent's "world" resets between turns; whatever it produced exists only as text inside the message stream. In a code agent, state lives in files on disk. The agent edits a file in turn 1, and that file is still edited in turn 2 — not because the agent remembers, but because the filesystem remembers. The agent reads what it (or its predecessor session) wrote, builds on it, and the user can inspect the result with ls, git diff, or by opening the file in their editor.

This is the difference between an agent that produces conversational output (read once, used immediately, discarded) and an agent that produces artifacts (saved, owned, modified later, possibly by humans). The implication for design: the agent's job isn't to generate text the user reads — it's to converge a project's filesystem to a desired state. The "output" of a code agent task isn't what it said in chat; it's what files now exist that didn't, what files changed, and what state they're in.

Property 2: Programmable verification

The thing that distinguishes code from prose: you can execute it. A research agent that writes a paragraph about Postgres can't check whether the paragraph is correct without an LLM judge. A code agent that writes a function can run the tests against the function. If the tests pass, the function works. If they fail, the function doesn't. The grader is deterministic, the feedback is immediate, and the standard is non-negotiable.

This is the single biggest architectural advantage code agents have over every other agent type. Chapter 3.3's whole apparatus — LLM-as-judge, calibration sets, position bias — exists because most agent outputs can't be deterministically verified. Code can. A test suite is the kind of grader research agents wish they had: it disagrees with humans rarely, runs in seconds, costs cents to invoke, and never has self-preference bias.

The design implication: your code agent should be structured around the verification loop, not the generation step. The interesting question isn't "did Claude write code?" — it's "did the code pass the tests?" A code agent that writes a 500-line refactor and never runs the tests is a research agent that happens to output Python. A code agent that writes 20 lines, runs the tests, sees one fail, fixes it, and runs the tests again until green is a different shape entirely.

Property 3: A bounded action space

Research agents have, in principle, the whole internet as their action space. Each tool call could surface arbitrary new content. Reasoning about what they'll do next, or what they did last week, is hard.

Code agents operate inside a known filesystem at a known repository. The action space is enumerable: every action is some combination of read-a-file, write-a-file, edit-a-file, run-a-command, search-the-tree. The set of files is finite. The set of commands the agent can run is whatever you grant it. This boundedness has two consequences:

First, you can predict and audit what the agent did. git diff shows you every byte of change, file by file. There's no "the agent decided to email someone" branch hiding outside the repo. The agent's effects are inspectable, in detail, by tools the user already knows.

Second, you can constrain the agent's capabilities precisely by controlling which tools and commands are allowed. Read-only mode (the agent can browse but not modify); write-only-in-subdir mode (the agent can touch src/ but not the rest of the repo); no-network mode (the agent can't curl anywhere). These constraints would be hard to enforce on a general-purpose agent; on a code agent they're a one-line config change.

Property 4: Tight feedback loops

Every action a code agent takes can be checked, fast. A syntax error fires on save (the file won't even parse). A type error fires on type-check (seconds). A test failure fires on test run (seconds to minutes). A runtime error fires on execution (seconds). Each level of check is cheaper and faster than the next, and they form a feedback ladder the agent can climb deliberately.

Compare this to a research agent: it produces an answer, and the only feedback is either "the user accepted it" (which the agent never sees in the same session) or "the eval judge graded it" (after the run, in a separate process). The agent itself doesn't learn within a single task; it can't self-correct because it has no signal that it's wrong.

A code agent has signal continuously. It can write the function, run the test, see the failure message, fix the function, rerun the test, see another failure, fix again — all within one task. The agent that uses this feedback well is far more effective than the agent that just generates and hopes.

The four properties together

These four properties — persistent state, programmable verification, bounded action space, tight feedback loops — combine into a specific shape that other agent types don't have. Code agents are convergent systems: they keep iterating until a deterministic check passes. Research agents are divergent systems: they explore until a deadline or budget runs out.

┌─────────────────────────────────────────────────────────────┐ │ RESEARCH AGENT vs. CODE AGENT │ │ │ │ State: in conversation State: on disk │ │ Verification: LLM judge Verification: tests/build │ │ Action space: open-ended Action space: bounded │ │ Feedback: end of task Feedback: continuous │ │ │ │ Optimizes: helpful synthesis Optimizes: green build │ │ Failure mode: hallucination Failure mode: red test │ │ Recovery: re-search Recovery: re-edit + rerun │ └─────────────────────────────────────────────────────────────┘

This is why patterns that work for research agents (LLM-as-judge evaluation, RAG-heavy designs) need adaptation for code, and patterns that work for code agents (test-driven verification, file-level state) don't transfer directly to research. The chapter from here on focuses on the code-specific shape; later chapters in Part IV cover Computer Use, Research, and Multi-Agent variants.

Question

Is "code agent" different from "AI pair programmer" like Copilot?

Yes — meaningfully. Copilot (and the 2021–2023 generation of "AI coding assistants") was autocomplete on steroids: the model suggested completions inline as you typed. The user wrote most of the code; the AI filled in tedious bits. This is a useful tool, but it's not an agent. The user is still the one driving — making decisions about what to build, navigating the codebase, running tests.

A code agent inverts that relationship. The user states a goal ("add OAuth login to this app") and the agent drives — it reads the existing code to understand structure, edits multiple files, runs the tests, fixes failures, iterates. The user reviews the result rather than authoring it. Same models underneath, very different system around them.

The four properties in this step are what makes it an agent rather than a completion tool. Persistent filesystem state, self-verification, bounded action space, feedback loops — Copilot has none of these; Claude Code, Cursor's agent mode, Aider, and the Agent SDK all do.

Question

"Self-verification via tests" sounds great in theory. Doesn't it break when there are no tests, or when the tests are bad?

It does. The architectural advantage of code agents over research agents is real when the verification ladder works: tests exist, run fast, cover the changes you're making, and accurately reflect correctness. When any of those breaks down, code agents lose much of their edge. A codebase with no tests is a codebase where the agent is generating and hoping just like a research agent.

The practical consequence: code agents are most effective on well-tested codebases, and one of the most valuable things you can do before turning a code agent loose on a project is shore up the test suite. Conversely, an agent that contributes a feature also needs to contribute tests for that feature — otherwise the next agent (or human) working on it has no signal.

Step 3 of this chapter covers what to do when tests are weak: cheaper checks (type checkers, linters, the build itself) form lower rungs of the verification ladder that still provide useful signal when full tests don't.

Question

"Bounded action space" sounds restrictive. What about agents that need to make external API calls, query databases, etc.?

The boundedness is about what tools you grant, not about what's theoretically possible. A code agent absolutely can have a curl tool, a psql tool, an HTTP-fetch tool — but each of those is an explicit grant in the configuration, not a default. The point isn't that code agents can't reach beyond the filesystem; it's that the reach is enumerable and configurable per-task.

Contrast with a free-form research agent: by default it has web search, web fetch, and possibly other "general" tools that span unknown territory. Constraining a research agent to only-touch-these-domains is harder than configuring a code agent to only-edit-these-files because the constraint surface is more open.

STEP 2

The action space: edit files, run commands, read what you wrote.

The most common mistake when building a code agent: thinking the tool surface is "generate code." It isn't. The right tool surface is the set of operations a developer performs on a project — read a file, edit a file, search for text across files, navigate the directory tree, run commands. The model produces code as the content of edit operations, but the operations themselves are filesystem actions. This distinction is what separates an agent that operates on a real project from a glorified single-prompt code generator.

The six tools that cover 95% of code-agent work

Almost every modern code agent — Claude Code, Aider, Cline, the Agent SDK reference implementation — converges on roughly the same tool set. The exact names vary; the shapes are nearly identical.

read_file

Load file contents into context

Read with optional line range so partial reads of huge files are cheap.

write_file

Create a new file or overwrite

For new files only. Using this to "edit" existing files is anti-pattern (Step below).

edit_file (str_replace)

Replace one substring with another

The workhorse. Atomic, reviewable, minimizes context churn.

glob

Find files by name pattern

"Find all *.test.ts files under src/". Cheap and very high signal.

grep

Search file contents

"Find references to OAuthProvider across the codebase." Regex-aware.

bash / run_command

Run shell commands

Tests, builds, type-checks, package install, git operations. The verification ladder.

You'll see additional tools in production agents — task tracking, sub-agent dispatch, structured search — but these six are the foundation. Everything else is an optimization or specialization on top.

Why str_replace beats "rewrite the file"

One design decision is worth its own discussion because it's where naive implementations go wrong and the right answer isn't obvious: how the agent edits an existing file.

The naive approach: the agent reads the file, generates the new version, writes it back. Three failure modes follow.

Failure 1: context cost. Reading a 500-line file costs ~3K tokens. Writing it back costs another ~3K output tokens. Doing this for 10 edits across 5 files in a single agent run burns through 30K+ tokens of work that could have been a few hundred tokens of str_replace operations. At scale this is real money.

Failure 2: silent regressions. The agent reads the file, makes the edit it intended, and re-emits the file. But it also rewrites the import order, drops a comment it thought was redundant, or modifies an unrelated function "for consistency." The user opens git diff and sees changes they didn't ask for. The trust hit is severe even when the changes are technically improvements.

Failure 3: review burden. A diff of "old file (500 lines) → new file (497 lines)" is unreviewable in detail. The user can't tell what changed without doing their own diff. Compare to a str_replace with a 5-line old_str and a 7-line new_str — that's an atomic, scoped, reviewable change. Code review tools display it natively. The user knows exactly what to look at.

The fix is the str_replace operation: the agent passes the exact substring to find and the exact substring to replace it with. The tool implementation finds the substring (failing if it's not unique), replaces it, and writes the file. Three properties fall out:

Atomic. Either the substring matched and the replacement was applied, or the operation failed with an error. No partial updates.
Minimal context. The agent only needs to read enough of the file to identify a unique substring around the change site. Often a 20-line window of context is enough; full-file reads become rare.
Reviewable. The operation's old_str and new_str are the diff. Audit logs of code agent runs are readable; what changed is right there.

This is why every modern code agent ships str_replace (or an equivalent, often called edit_file or apply_diff) as its primary editing primitive, and reserves write_file for new files only.

The minimal skeleton: 150 lines of code agent

Pulling all six tools together with the agent loop from chapter 1.1, here's what a working code agent looks like at minimum. This is not Claude Code or the Agent SDK — it's the shape underneath them. Building this yourself is the right exercise for understanding the design.

# agent/code_agent.py
import os, subprocess, glob as globmod, re
from anthropic import AsyncAnthropic

REPO_ROOT = os.environ.get("AGENT_REPO_ROOT", os.getcwd())
client = AsyncAnthropic()

# --- Tool definitions (full descriptions omitted for brevity; see chapter 0.3) ---
TOOLS = [
    {"name": "read_file",    "description": ..., "input_schema": {...}},
    {"name": "write_file",   "description": ..., "input_schema": {...}},
    {"name": "str_replace",  "description": ..., "input_schema": {...}},
    {"name": "glob",          "description": ..., "input_schema": {...}},
    {"name": "grep",          "description": ..., "input_schema": {...}},
    {"name": "bash",          "description": ..., "input_schema": {...}},
]

# --- Handlers ---
def safe_path(path: str) -> str:
    """Resolve path within the repo root; refuse to escape it."""
    full = os.path.realpath(os.path.join(REPO_ROOT, path))
    if not full.startswith(os.path.realpath(REPO_ROOT) + os.sep):
        raise ValueError(f"Path {path!r} escapes repo root")
    return full

async def read_file(path: str, start_line: int = 1, end_line: int = -1) -> str:
    with open(safe_path(path)) as f:
        lines = f.readlines()
    end = len(lines) if end_line == -1 else end_line
    return "".join(f"{i:5d}  {ln}"
                    for i, ln in enumerate(lines[start_line-1:end], start_line))

async def write_file(path: str, content: str) -> str:
    p = safe_path(path)
    if os.path.exists(p):
        raise ValueError(f"{path} exists; use str_replace to edit")
    os.makedirs(os.path.dirname(p), exist_ok=True)
    with open(p, "w") as f: f.write(content)
    return f"Created {path} ({len(content)} bytes)"

async def str_replace(path: str, old_str: str, new_str: str) -> str:
    p = safe_path(path)
    with open(p) as f: text = f.read()
    count = text.count(old_str)
    if count == 0:
        raise ValueError(f"old_str not found in {path}")
    if count > 1:
        raise ValueError(f"old_str matches {count} times; add context to make it unique")
    with open(p, "w") as f: f.write(text.replace(old_str, new_str))
    return f"Replaced 1 occurrence in {path}"

async def glob_files(pattern: str) -> str:
    matches = globmod.glob(os.path.join(REPO_ROOT, pattern), recursive=True)
    return "\n".join(os.path.relpath(m, REPO_ROOT) for m in matches[:200])

async def grep_files(pattern: str, glob_filter: str = "**/*") -> str:
    rx = re.compile(pattern)
    hits = []
    for path in globmod.glob(os.path.join(REPO_ROOT, glob_filter), recursive=True):
        if not os.path.isfile(path): continue
        try:
            for i, line in enumerate(open(path), 1):
                if rx.search(line):
                    rel = os.path.relpath(path, REPO_ROOT)
                    hits.append(f"{rel}:{i}: {line.rstrip()}")
                    if len(hits) >= 100: break
        except UnicodeDecodeError: continue
    return "\n".join(hits) or "(no matches)"

async def run_bash(command: str, timeout: int = 120) -> str:
    p = subprocess.run(command, shell=True, cwd=REPO_ROOT,
                       capture_output=True, text=True, timeout=timeout)
    return f"exit={p.returncode}\nstdout:\n{p.stdout}\nstderr:\n{p.stderr}"

HANDLERS = {"read_file": read_file, "write_file": write_file,
            "str_replace": str_replace, "glob": glob_files,
            "grep": grep_files, "bash": run_bash}

# --- The loop (the same shape from chapter 1.1) ---
async def run_code_agent(task: str, max_steps: int = 40):
    messages = [{"role": "user", "content": task}]
    for _ in range(max_steps):
        response = await client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system=SYSTEM_PROMPT,   # <-- conventions, test commands, project layout
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason != "tool_use":
            return response

        results = []
        for block in response.content:
            if block.type != "tool_use": continue
            try:
                output = await HANDLERS[block.name](**block.input)
                results.append({"type": "tool_result",
                                "tool_use_id": block.id, "content": output})
            except Exception as e:
                results.append({"type": "tool_result",
                                "tool_use_id": block.id,
                                "content": f"Error: {e}",
                                "is_error": True})
        messages.append({"role": "user", "content": results})

    raise RuntimeError("step budget exceeded")

That's a working code agent. Roughly 100 lines plus tool descriptions; the structure is identical to the agent loop from Build, with code-specific tools. Adding a streaming layer (chapter 2.4), prompt caching on the system prompt (chapter 2.2), and observability (chapter 2.1) gets you to production shape. None of those additions change the core architecture; they just make it run well.

The system prompt is where most of the work happens

One subtle thing: in a code agent, the system prompt does heavy lifting that's often invisible. It tells the agent:

What conventions the project uses (TypeScript strict mode, prefer functional style, conventional commits, etc.)
How to run the tests (npm test vs pytest vs cargo test)
How to run the type-checker, the linter, the build
What directories are off-limits
What workflow to follow (small commits, run tests after every change, etc.)

Claude Code reads this from a CLAUDE.md file at the repo root. Aider reads it from .aider.conf.yml and convention files. The Agent SDK lets you pass it as system=. The mechanism varies; the purpose is the same: tell the agent how this specific project wants to be worked on. A code agent without project conventions in its system prompt is a generic Python developer dropped into your repo; with conventions it's a contributor who knows the house style.

The discipline: treat your CLAUDE.md (or equivalent) as documentation that compounds. Every time you have to correct the agent's behavior on a recurring issue ("we use 2-space indent, not 4"), add a line to the conventions file. Over a few weeks, the agent's hit rate on first attempt climbs from acceptable to excellent.

Question

Why grep and glob as separate tools? Couldn't bash do both with the right command?

It could. But there are three reasons to expose them as first-class tools rather than letting the agent find/grep via bash.

First, structured output: glob and grep handlers can return results in a consistent format (path:line: text) that's easy for the model to parse and act on. The model doesn't have to figure out whether the bash output has a header, trailing newlines, etc.

Second, controllable: handlers can enforce result limits, redact sensitive paths, and filter binary files automatically. A raw bash grep -r can produce a multi-megabyte dump if the pattern is too broad; a wrapped handler caps it at 100 hits or 50KB.

Third, safer: granting bash access is a security decision (the agent can now run arbitrary commands); granting glob and grep separately doesn't expand the attack surface. Many production deployments enable read tools without enabling bash, exactly for this reason.

Question

str_replace fails when old_str isn't unique. What happens when the agent needs to make the same edit in multiple places?

Three options, in order of how production agents handle it.

Most common: the agent makes the old_str more specific so it becomes unique. If you want to rename a variable that appears in 5 places, you don't try to replace just the name — you replace the surrounding 2-3 lines around each instance, one at a time. The agent learns this pattern quickly from the failure message.

Sometimes useful: a separate str_replace_all tool that takes a flag indicating "I know this matches multiple times, replace all of them." Risky because it can over-match; usually not worth the footgun.

For renames specifically: use a real refactoring tool (language-server-based rename, rg --replace) via bash. The agent should learn that renames are a structured operation, not a text substitution.

Question

What about edits that span large code blocks — say, replacing a 200-line function?

str_replace handles this fine — old_str and new_str can be hundreds of lines. The constraint is that the old_str has to appear once in the file (which a 200-line block almost certainly does).

The token cost question is real, though: replacing 200 lines means the agent has 200 lines of old_str and 200 lines of new_str in its output for that single tool call. For very large refactors, sometimes the cheaper approach is to write a new file alongside the old one, then update the imports. Or write a small script that does the transformation and run it via bash. The agent picks the right approach if its system prompt gives it the latitude.

STEP 3

The verification loop is the whole game.

If you take one lesson from this chapter, take this: a code agent that doesn't verify its own work isn't a code agent. It's a research agent that produces plausible-looking code. The difference between the two — between "the agent generated this function" and "the agent generated this function, ran the tests, saw a failure, fixed it, and got to green" — is the entire reason code agents work better than chat agents at coding.

This step is about how to structure that verification loop and how to make sure the agent actually uses it.

The verification ladder

Not all verification is equally cheap. A type-check is fast; a test run might take a minute; a full integration test might take five. The agent should use the cheapest check that gives signal on what it just did, climbing the ladder only when needed.

┌─────────────────────────────────────────────────────────────┐ │ THE VERIFICATION LADDER (cheap → expensive) │ │ │ │ Rung 0: File written successfully (str_replace returned) │ │ Rung 1: Syntax / parser check (cheapest signal) │ │ ─ Python: `python -c "import ast; ast.parse(...)"`│ │ ─ TS/JS: `tsc --noEmit` │ │ ─ Rust: `cargo check` │ │ Rung 2: Linter / formatter │ │ ─ ruff, eslint, golangci-lint │ │ Rung 3: Type check (when separate from compile) │ │ ─ mypy, pyright, flow │ │ Rung 4: Unit tests (fast subset first) │ │ ─ pytest tests/unit/ │ │ ─ vitest run --reporter=basic │ │ Rung 5: Full test suite │ │ Rung 6: Integration / e2e tests (slowest, most signal) │ │ │ │ Each rung is cheaper than the next AND catches different │ │ bugs. Skipping rungs means catching bugs later, when │ │ the feedback loop is slower. │ └─────────────────────────────────────────────────────────────┘

The instinct most developers have — and the right instinct for the agent — is "run the cheapest check that could conceivably catch what I just broke." After editing a function signature: run the type check. After editing test setup: run one test. After a multi-file refactor: run the full suite. The agent that does this well finishes tasks 3–5× faster than the one that runs the full suite after every change.

The system prompt should teach this. Concretely:

# From CLAUDE.md, system prompt for the agent

# Verification workflow

After making any change, run the appropriate verification:

- **Single-function change in a typed language**: `tsc --noEmit` (or
  `mypy src/`).  Type errors here block everything else.
- **Behavioral change to a single module**: run that module's tests.
  `pytest tests/path/to/module/ -x`
- **Cross-cutting change** (touches more than 3 files): run the full
  unit suite. `npm run test:unit`
- **Before declaring "done"**: full test suite passes, including
  integration. `npm test`

Always run the verification before reporting the task complete.
If verification fails, fix the failure and verify again.  Do not
report success on the basis of "the code looks right".

That last sentence — "do not report success on the basis of 'the code looks right'" — is doing real work. It explicitly closes off the failure mode where the agent generates code, doesn't run anything, and reports back "done." A reasonable-sounding but unverified completion is the most common failure mode of unsupervised code agents.

What goes wrong when verification is skipped

Three failure modes to recognize, all variants of "the agent said it was done but it wasn't":

Confident-but-broken edits. The agent reads the relevant file, edits it carefully, and reports completion. The user pulls the change, runs the tests, and sees four failures. The agent's edit was wrong in a way that's obvious from running the tests — but the agent didn't run them. This is the failure that the verification-ladder discipline is designed to prevent.

Edited the wrong thing. The agent searched for a function name, found it in two files (one being a test file), and edited the wrong copy. Running the tests would have shown the actual function wasn't modified. Skipping the test run hides the mistake until the user discovers it.

Forgot the side effect. The agent modified a function and the tests pass — but the type checker would have flagged a downstream caller that needs to be updated. The agent ran the tests (good!) but not the type checker (bad). The whole ladder matters.

The pattern: each rung of the ladder catches different bugs. Skipping rungs means bugs sneak through. The agent that runs only the cheapest check ships fast but ships broken; the agent that runs all rungs after every micro-change is glacially slow but reliable. The right behavior is climbing the ladder appropriately for each change — and that's a discipline the system prompt can encode and the tool descriptions can reinforce.

The fast-feedback loop in action

What the verification loop looks like in a real agent trace — this is the kind of pattern you want to see in the agent's behavior:

══ TASK: Fix the bug where the date parser crashes on empty strings ══

[turn 1: agent]
→ grep "def parse_date" --glob "**/*.py"

[result]
src/datetime/parser.py:12: def parse_date(s: str) -> date:
tests/datetime/test_parser.py:34: def test_parse_date_iso():

[turn 2: agent]
→ read_file src/datetime/parser.py

[result]
[20 lines of code, parse_date function visible]

[turn 3: agent]
→ read_file tests/datetime/test_parser.py 30 60

[result]
[shows existing tests; notes no empty-string test]

[turn 4: agent]
→ bash "pytest tests/datetime/test_parser.py -x"

[result]
3 passed in 0.4s
  // agent confirms baseline is green before changing anything

[turn 5: agent]
→ str_replace tests/datetime/test_parser.py
   old_str: def test_parse_date_iso():\n    assert parse_date("2025-01-01") == ...
   new_str: def test_parse_date_iso():\n    assert parse_date("2025-01-01") == ...\n\n
            def test_parse_date_empty_string_raises():\n    with pytest.raises(ValueError):\n
                parse_date("")

[result]
Replaced 1 occurrence in tests/datetime/test_parser.py

[turn 6: agent]
→ bash "pytest tests/datetime/test_parser.py::test_parse_date_empty_string_raises -x"

[result]
FAILED tests/datetime/test_parser.py::test_parse_date_empty_string_raises -
IndexError: string index out of range
  // agent: good — test fails for the right reason. now fix it.

[turn 7: agent]
→ str_replace src/datetime/parser.py
   old_str: def parse_date(s: str) -> date:\n    year = int(s[0:4])
   new_str: def parse_date(s: str) -> date:\n    if not s:\n        raise ValueError(
            "empty date string")\n    year = int(s[0:4])

[result]
Replaced 1 occurrence in src/datetime/parser.py

[turn 8: agent]
→ bash "pytest tests/datetime/test_parser.py -x"

[result]
4 passed in 0.4s
  // agent: green. one more check.

[turn 9: agent]
→ bash "mypy src/datetime/"

[result]
Success: no issues found in 1 source file
  // agent: clean. done.

[turn 10: agent]
"Done. Added empty-string validation to parse_date with a test that
verifies the ValueError is raised. Tests pass, mypy clean."

Notice what the agent didn't do: it didn't read the whole codebase. It didn't write a 500-line refactor. It didn't make multiple edits across files before running anything. It made one targeted change, verified it, observed the failure (which was the expected failure of the new test), made one more targeted change, verified it, and stopped.

This is the texture of effective code-agent work — small steps, fast feedback, fail-and-fix loops. The system prompt and the tool design both push toward this rhythm.

Test-first as a workflow choice for agents

Notice also that the agent wrote the failing test before writing the fix. This isn't accidental — it's the test-driven discipline applied to agents. The advantages, all amplified for agents:

The test is the spec. Writing the test first forces the agent to specify behavior precisely. If the agent can't write the test, it doesn't understand the requirement well enough to write the code.
Definition of done is mechanical. Task is complete when the new test passes and the old tests still pass. No ambiguity about whether the agent "really" finished.
Regression protection is automatic. The test that just passed is the test that catches the bug if it recurs. No separate "add regression test" step.

You don't have to enforce TDD universally — for simple changes it's overhead. But for bug fixes specifically, the pattern "reproduce as a failing test, fix until passing" is the cleanest discipline and the easiest one to encode in the system prompt.

The trap: faking verification

The most insidious failure mode: the agent runs something that returns success but doesn't actually verify what you wanted.

Examples you'll see:

Agent runs pytest -k test_nothing_relevant and reports "tests pass."
Agent runs npm test, which silently skips the failing test because of a misconfigured matcher.
Agent runs true as a placeholder when the test command isn't available, gets exit code 0, declares success.
Agent modifies the test to make it pass instead of fixing the code. The test "passes" but doesn't test what it should.

Defenses against these:

Inspect outputs, don't trust exit codes alone. The agent's prompt should require reporting what was run and what the output was, not just "tests passed." Reviewers (human or LLM) should be able to verify the right thing was run.

Make verification commands explicit in CLAUDE.md. Don't let the agent guess "what's the test command for this project?" — tell it. "npm run test:strict is the verification command. Other test commands skip slow tests and shouldn't be used for verification."

Add a meta-check. For high-stakes changes, the agent's final step should be a diff of the test outputs: "Before my change, X passed. After my change, Y passed. Y must include X plus the new test, not be a different set." This forces the agent to compare baseline-vs-final, not just look at the final output in isolation.

Human review of the diff, always. For agent work that ships to production, a human looks at git diff before merge. Not because the agent is untrustworthy — but because the cost of trust falling through (a subtle test-faked-pass that ships a bug) is high. This isn't a code-agent-specific point, but it's where the value compounds: the human reviewing 30 lines of diff is way more efficient than the human authoring those 30 lines from scratch.

If your code agent reports "done" without showing a verification step in its trace, treat that as a red flag. Either it didn't run the verification (so you don't know if the change works), or it ran something that doesn't actually verify what you care about (so you don't know what it ran). In either case, the right move is to require the verification step explicitly and treat agents that skip it as broken — not as "fast."

Question

What about codebases where running the tests takes 20+ minutes? The verification loop sounds nice but it's not free.

Two responses, depending on whether the slow tests are essential or accidental.

Essential slow tests (integration, e2e, browser tests) usually have a fast unit subset that runs in seconds. The agent uses the fast subset during the inner loop (every change verifies against unit tests), and runs the slow tests only at "task complete" gates. The slow tests still catch what they catch; they just don't slow down the inner-loop rhythm.

Accidental slow tests (slow because of bad fixtures, unnecessary DB rebuilds, missing parallelism) are a separate problem. The code agent isn't the right tool to fix them, but it might be the right pressure to force fixing them — "the agent's iteration speed is gated on this; we need it to be faster" is a legitimate engineering priority.

One specific trick for slow suites: parallel test execution where the framework supports it. Most modern test runners (pytest-xdist, vitest, etc.) can fan out across cores and cut 20-minute suites to 3-minute suites with no behavioral change.

Question

What if the codebase has zero tests? Does a code agent become useless?

Less useful, definitely, but not useless. You lose the most reliable verification rung, but the others remain: syntax/parse, type check, linter, build. These catch a meaningful fraction of bugs — anything that doesn't compile or doesn't type-check stays out of the codebase regardless of whether tests exist.

The agent can also be tasked with creating tests as a separate step before doing the real work. "Add a test that demonstrates the current behavior" is a sensible first turn before "now change the behavior." After a few rounds, you have the test scaffolding the codebase was missing.

One pattern worth knowing: agents are often very effective at writing characterization tests (tests that capture what the code does, regardless of whether that's correct). These are cheap to write and create a safety net for refactors. A code agent can produce hundreds of these from observation, then human review picks the ones to keep.

Question

Should the agent run tests in parallel with making more edits, or strictly sequential?

Strictly sequential, almost always. The reason: parallelism here is false economy. The agent can't safely make a second edit before knowing whether the first one worked — if it did, the second edit might be unnecessary; if it didn't, the second edit might be wrong. Each cycle of edit → verify → react is dependent.

The exception: when verification is slow and the agent has a high-confidence next step that's logically independent. For example, the agent might run a long test suite while reading documentation in parallel. But running two simultaneous edits is asking for trouble — diff hell, conflicting changes, no clean rollback.

STEP 4

Skills, the Agent SDK, and the landscape you'll work in.

Steps 1–3 cover what a code agent is, architecturally. This step covers what you'll actually use: Anthropic's Skills system (the way Claude Code packages procedural knowledge), the Agent SDK (the Python/TypeScript library that wraps the agent loop), and the Managed Agents service (a hosted version for production). Plus the decision of when to build your own code agent vs. adopt one of these.

Skills: packaged procedural knowledge

Step 2's CLAUDE.md captures project conventions. Skills are the next layer up: portable bundles of instructions and resources that teach the agent how to do a specific kind of task, that work across projects and across the Claude product surface (Claude.ai, Claude Code, the API, the Agent SDK).

A skill is a folder. The folder must contain a SKILL.md file with YAML frontmatter. The frontmatter has two required fields: name and description. Everything else is optional, and the rest of the folder can contain whatever resources support the skill — scripts the agent can execute, reference docs the agent can load when needed, templates.

# skills/postgres-migration/SKILL.md
---
name: postgres-migration
description: Generate, review, and run PostgreSQL migrations following the
  project's conventions (alembic with autogenerate, named revisions, paired
  upgrade/downgrade). Use when the user asks to add/modify/remove columns,
  tables, indexes, or constraints, or when schema changes are needed.
---

# Postgres migration workflow

This project uses Alembic for migrations.  All schema changes go through it.

## When to use

- Adding/removing/renaming columns
- Adding/dropping tables
- Adding/dropping indexes
- Modifying constraints

## Standard workflow

1. Inspect the current state: `alembic current` and `alembic history --verbose`
2. Generate the migration: `alembic revision --autogenerate -m "<short_desc>"`
3. **Review the generated migration**: autogenerate is imperfect. Specifically check:
   - Does it correctly detect type changes? (Often misses ENUM modifications)
   - Are downgrade() operations the inverse of upgrade()?
   - For renames, does it generate add+drop instead of an actual rename?
     (Fix manually — add+drop loses data.)
4. Run locally: `alembic upgrade head`
5. Run the test suite: migrations must not break tests
6. Test downgrade: `alembic downgrade -1 && alembic upgrade head`

## Special cases

- **Adding a non-nullable column to an existing table**: must include a
  default OR do this in two migrations (add nullable, backfill, alter not null).
  See examples/non_null_column.py for the template.
- **Large table operations**: `ALTER TABLE` on a large table locks it.
  Use the pg_repack pattern documented in references/large_table_ops.md.
- **Index creation**: always use `CREATE INDEX CONCURRENTLY` for production
  tables. The autogenerate skips the CONCURRENTLY hint; add it manually.

## What this skill does not cover

- Data migrations (logic, not schema): write a separate one-off script
- Production deployment: handed off to ops via SECURITY-RELEASE.md

This is a real shape — descriptive enough that Claude knows when to use it, opinionated enough to encode the project's actual workflow, with references to deeper documentation that get loaded only when relevant.

Progressive disclosure: why skills don't bloat context

The key design choice that makes skills scalable: progressive disclosure. The agent doesn't load every skill's full contents into context at startup. Instead, three levels:

┌─────────────────────────────────────────────────────────────┐ │ PROGRESSIVE DISCLOSURE │ │ │ │ Level 1 (always loaded): │ │ Each skill's `name` and `description` from frontmatter. │ │ Tens of tokens per skill. Tells the agent what exists. │ │ Loaded into system prompt at session start. │ │ │ │ Level 2 (loaded on match): │ │ When a user request matches a skill's description, the │ │ agent reads the full SKILL.md body. Hundreds to thousands │ │ of tokens. Loaded as the first action when the skill │ │ becomes relevant. │ │ │ │ Level 3 (loaded on demand): │ │ References (examples/, references/) and executable scripts │ │ are loaded only if the SKILL.md says to. Allows skills │ │ to bundle large reference material without paying for it │ │ on every session. │ │ │ │ Effect: 50 skills total cost ~2K tokens at startup, │ │ not 200K. Each invocation costs only what that skill │ │ actually needs. │ └─────────────────────────────────────────────────────────────┘

This is the design choice that makes "agents with hundreds of specialized skills" practical. Without progressive disclosure, every skill you add would consume context for every session whether the skill was used or not; with it, skills are essentially free until invoked.

The Agent SDK

The Agent SDK is Anthropic's library (Python and TypeScript) for building agents that look architecturally similar to Claude Code, with the same skills system, tool loop, and conventions — but for any task, not just coding. The SDK provides:

The agent loop, abstracted. You don't write the while True: call_model; run_tools loop yourself; the SDK does it. You define tools and skills; the SDK runs the conversation.
Built-in tool runner. The SDK handles tool dispatch, error wrapping, parallel execution. You provide handler functions; the SDK wires them up to the model.
Skill discovery. Skills placed in .claude/skills/ (project-scoped) or ~/.claude/skills/ (user-scoped) are auto-discovered and made available to the agent following the progressive-disclosure pattern.
Streaming events. The SDK's query() function is an async generator that yields events as the agent works — useful for the streaming endpoint pattern from chapter 2.4.

Minimal usage in Python:

from claude_agent_sdk import query, ClaudeAgentOptions

options = ClaudeAgentOptions(
    cwd=".",
    setting_sources=["user", "project"],
    allowed_tools=["Skill", "Read", "Edit", "Bash", "Glob", "Grep"],
    model="claude-sonnet-4-5",
)

async for event in query(prompt="Add a /health endpoint to the API", options=options):
    # Each event is a streamed update: token, tool_use, tool_result, status
    handle(event)

What the SDK gives you over building from scratch (Step 2's 150 lines): polished tools with the same shape Claude Code uses, integrated skills system, security defaults that match Claude Code's defaults, and the ability to upgrade to a Managed Agents deployment without reshaping your code.

What you give up: customization of the loop itself. If you need to interleave model calls with custom logic in unusual ways (multi-model cascading per-step, custom retry policies, complex state machines around the loop), the SDK can feel constraining. Build from scratch for those cases; use the SDK when your needs fit the standard agent loop, which is most cases.

Managed Agents: the hosted shape

The newest piece of the landscape (in beta as of mid-2026): Managed Agents, where Anthropic hosts the agent runtime entirely. Instead of running the loop in your code, you create an Agent config (system prompt, tools, model) and start Sessions against it. Each session gets a sandboxed container as workspace; the agent runs server-side and acts on the container via tools.

The shape:

# 1. Create the agent once
agent = client.beta.agents.create(
    model="claude-sonnet-4-5",
    system="You are a code reviewer for the foo-api project...",
    tools=[...],
    name="foo-api-reviewer",
)

# 2. Start a session per task
session = client.beta.sessions.create(agent_id=agent.id)

# 3. Send messages; the server runs the loop
for event in client.beta.sessions.messages.stream(
    session_id=session.id,
    content="Review PR #123",
):
    handle(event)

Pricing is the standard token cost plus $0.08 per session-hour of runtime (only while actively running). Useful when the operational complexity of running an agent loop server-side isn't worth your time, or when you want Anthropic-managed sandboxing as the default. Less useful if you have specific infrastructure needs the hosted runtime doesn't accommodate.

The decision tree:

Building from scratch: when your agent loop has unusual control flow, when you need maximum customization, or when you're learning. Step 2's 150 lines is the right starting point.
Agent SDK in your own infrastructure: the default for production agents. Standard loop, full control over deployment, works in your VPC, integrates with your observability.
Managed Agents: when you want zero infrastructure, when sandboxed execution is the value, when you're prototyping fast and don't want to build a worker process.

When to build your own code agent vs. use Claude Code

Claude Code is the polished, opinionated code agent Anthropic ships. The Agent SDK lets you build similar systems for your own needs. The decision between them:

Use Claude Code when your need is "help engineers ship code faster on their own machines." Claude Code is the right tool for individual developer productivity, exploration tasks, and tightly-supervised work. It's installed locally, integrated with editors, and ships with the right defaults for interactive use.

Build with the Agent SDK when your need is "code-agent-shaped automation that's not just an engineer at a terminal." Examples: a CI bot that fixes flaky tests automatically, a service that responds to user issues by drafting fix PRs, an internal tool that runs nightly maintenance tasks. These have different UX, different security boundaries, different stakeholders than Claude Code is designed for.

The two are complementary, not competing. A team using Claude Code for individual productivity might also build SDK-based agents for CI automation; same underlying technology, different surfaces.

WORKED EXAMPLE

End-to-end: a code agent fixes a real bug.

To anchor the chapter in something concrete, here's a complete trace of a code agent fixing a representative bug. Not a toy example — the kind of bug that takes a human 20 minutes and an agent maybe 90 seconds. The agent uses Step 2's tools, the verification ladder from Step 3, and a project-level skill from Step 4.

The setup

A Python web API. Endpoint GET /users/{user_id}/orders returns a paginated list of a user's orders. A user reports: "When I pass page=1 the API returns my orders; when I pass page=0 the response is empty but no error." Looking at the spec, page=0 should be a 400 error — the API uses 1-indexed pages.

The task given to the agent: "Fix the /users/{user_id}/orders endpoint so page=0 returns a 400 with a clear error, and add a regression test."

The trace

══ Turn 1 — orient ═════════════════════════════════════════════════
agent → glob "**/orders.py"
result:
  src/api/routes/orders.py
  tests/api/routes/test_orders.py

agent → read_file src/api/routes/orders.py
[shows the get_user_orders handler, ~30 lines]

agent → read_file tests/api/routes/test_orders.py
[shows existing tests, ~80 lines]

══ Turn 2 — baseline ═════════════════════════════════════════════════
agent: "Confirming current state. Run the existing tests to be sure
they're green before my changes."

agent → bash "pytest tests/api/routes/test_orders.py -x"
result: 6 passed in 0.31s

══ Turn 3 — reproduce ═══════════════════════════════════════════════
agent: "Writing the failing test that captures the bug. page=0 should
400 with a clear error message."

agent → str_replace tests/api/routes/test_orders.py
   old_str: [last existing test, used as anchor]
   new_str: [same test + new test_page_zero_returns_400]
result: Replaced 1 occurrence

agent → bash "pytest tests/api/routes/test_orders.py::test_page_zero_returns_400 -x"
result:
  FAILED ... assert 200 == 400
  E       AssertionError: response status 200 (expected 400)
  E       Response body: {"orders": [], "page": 0, "total": 0}

agent: "Confirmed — the test fails for exactly the right reason.
Now the fix."

══ Turn 4 — fix ═════════════════════════════════════════════════════
agent → read_file src/api/routes/orders.py 14 24
[shows the handler more precisely]

agent → str_replace src/api/routes/orders.py
   old_str:
     @router.get("/{user_id}/orders")
     async def get_user_orders(user_id: str, page: int = 1, ...):
   new_str:
     @router.get("/{user_id}/orders")
     async def get_user_orders(user_id: str, page: int = 1, ...):
         if page < 1:
             raise HTTPException(
                 status_code=400,
                 detail="page must be 1 or greater (pagination is 1-indexed)",
             )

result: Replaced 1 occurrence

══ Turn 5 — verify ══════════════════════════════════════════════════
agent → bash "pytest tests/api/routes/test_orders.py -x"
result: 7 passed in 0.32s
  // new test passes, no regressions

agent → bash "mypy src/api/routes/orders.py"
result: Success: no issues found in 1 source file

agent → bash "ruff check src/api/routes/orders.py"
result: All checks passed!

══ Turn 6 — done ════════════════════════════════════════════════════
agent: "Done. Added a 400 response for page<1 with a clear message,
and a regression test (test_page_zero_returns_400) that captures the
behavior. All existing tests still pass, mypy and ruff clean.

Diff summary:
- src/api/routes/orders.py: +6 lines (page<1 check)
- tests/api/routes/test_orders.py: +10 lines (new test)

Recommend reviewing the error message text — I went with 'page must
be 1 or greater (pagination is 1-indexed)' but the project might have
a convention for error messages I haven't seen."

What this trace shows

Read the trace once for content, then read it again for texture. A few things deserve to be named:

The agent reproduced first, then fixed. Turns 1–3 are about understanding and reproducing; turns 4–5 are about fixing and verifying. This is the test-first discipline applied to bug fixing — confirm what's broken before changing anything.

The agent confirmed baseline-green before changing anything. Turn 2 verifies the existing tests pass. Without this, a later "the tests pass" doesn't mean what it sounds like — they might have been failing already. Always check baseline.

The agent climbed the verification ladder. After the fix, the agent ran the targeted tests (catching behavioral correctness), then mypy (catching type errors), then ruff (catching style/lint issues). Three rungs, each catching different bugs, all fast enough to do in turn 5.

The agent surfaced its own uncertainty. The closing message includes "I went with X but the project might have a convention for error messages I haven't seen." This is the agent telling you what to look at in review. A confident-and-wrong agent would have skipped this; the agent that flags genuine uncertainty earns trust faster.

The agent did not bloat the change. Two files touched, +16 lines total. No drive-by refactoring, no "while I'm in here" additions. The diff is exactly what was asked for. This is reviewable in 30 seconds.

This is what good agent work looks like. The same task, done by a code agent that doesn't follow these patterns, would be 5× as much diff with 0.3× the confidence that it actually works.

The trace is the deliverable

One subtle point worth surfacing: the trace itself is part of the work-product. A reviewer reading this trace knows exactly what the agent did, why, and what to scrutinize. That's a very different experience from "here's a PR with no context." For agent-generated code that's reviewed by humans, the legibility of the trace is half the value — and it's why the system prompt and verification discipline matter as much as the code itself.

End of chapter 4.1

Deliverable

A working mental model for code agents as a distinct architecture: persistent filesystem state, programmable verification, bounded action space, tight feedback loops. The six-tool surface (read, write, str_replace, glob, grep, bash) that covers most code work. The verification ladder discipline that turns "the agent wrote code" into "the agent shipped working code." Familiarity with Skills as the portable knowledge-packaging mechanism. Knowing when to build with the Agent SDK vs. adopt Claude Code. You can build, ship, evaluate, and reason about a code agent — and you understand why the architecture is shaped the way it is, not just what to copy.

Six core tools implemented: read_file, write_file, str_replace, glob, grep, bash
str_replace as the editing primitive; write_file reserved for new files
Sandboxing: safe_path enforcement, no escapes from repo root
System prompt with project conventions and explicit verification commands
Verification ladder: syntax → lint → types → unit tests → full suite, climbing as needed
Test-first discipline encoded in the system prompt for bug fixes
Trace legibility: agent reports what it ran, what passed, what it's uncertain about
Skills folder for portable procedural knowledge with YAML frontmatter
Progressive disclosure: skill metadata in system prompt, body loaded on match
Decision: build from scratch vs. Agent SDK vs. Managed Agents matches your shape
Human review of diffs as the final gate, even on well-verified work