Tool Use — The Agentic AI Field Guide

0.3

Part 0 / Foundations · The chapter that demystifies tool use

Tool use, at the protocol level.

Most tutorials treat tool use as "and then the SDK calls your function." That's not the abstraction — it's a lie of convenience. The real picture: the model emits a structured request, your code interprets it, your code returns a structured result, the model continues. Once this clicks, every weird tool-use bug becomes obvious. This chapter walks the protocol below the SDK, in both Anthropic and OpenAI shapes, with every field named and every common bug surfaced. After it you'll be the person on the team who can debug a "the model called my tool wrong" issue in under a minute.

STEP 1

Anatomy of a tool definition.

A tool definition is three things in a trench coat: a name, a description, and an input schema. Together they constitute the entire contract between your code and the model. There is no other channel through which the model learns how to use your tool. No docstrings. No source code. Just those three fields, plus whatever you say in the system prompt.

That's the first surprising thing to internalize. The model isn't introspecting your Python. It cannot see your function body. Everything it knows about search_docs comes from this:

{
  "name": "search_docs",
  "description": "Search the indexed documentation corpus for chunks relevant to a query. Returns the top 5 chunks by relevance. Use this when the user asks about technical topics, API usage, configuration, or anything that might be in the docs. Do NOT use for casual conversation or questions about the user's personal data.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "A search query in natural language. Should be specific enough to retrieve relevant chunks. Example: 'how to configure autovacuum naptime' rather than just 'vacuum'."
      },
      "section": {
        "type": "string",
        "enum": ["admin", "developer", "reference"],
        "description": "Optional. Restrict search to one section."
      }
    },
    "required": ["query"]
  }
}

{
  "type": "function",
  "name": "search_docs",
  "description": "Search the indexed documentation corpus for chunks relevant to a query. Returns the top 5 chunks by relevance. Use this when the user asks about technical topics, API usage, configuration, or anything that might be in the docs. Do NOT use for casual conversation or questions about the user's personal data.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "A search query in natural language. Should be specific enough to retrieve relevant chunks. Example: 'how to configure autovacuum naptime' rather than just 'vacuum'."
      },
      "section": {
        "type": "string",
        "enum": ["admin", "developer", "reference"],
        "description": "Optional. Restrict search to one section."
      }
    },
    "required": ["query"],
    "additionalProperties": false
  },
  "strict": true
}

Two providers, the same three ideas, slightly different shapes. Anthropic puts the JSON Schema under input_schema; OpenAI puts it under parameters and requires additionalProperties: false when strict: true is set. Other than that, identical contract.

The description is doing 90% of the work

The most common mistake people make with tool definitions is treating the description as a label. It isn't. The description is the prompt for that tool — it's what the model reads when deciding whether to call this tool, with what arguments, and how. A bad description is the single most common cause of tools being used wrong.

Look at the difference. Here's the same tool with a description written carelessly:

{
  "name": "search_docs",
  "description": "Searches the docs",
  "input_schema": { ... "query": {"type": "string"} ... }
}

That description does two harmful things: it tells the model when to use the tool (always, because nothing forbids it), and it tells the model how to format the query (no guidance, so the model will sometimes pass "vacuum" and sometimes "PostgreSQL vacuum autovacuum tuning configuration" with no consistency). Compare the verbose version above, which establishes when ("technical topics, API usage..."), when not ("not for casual conversation"), and how ("specific enough to retrieve relevant chunks, example: ...").

Per the Anthropic docs: "Unlike other prompts for Claude which rely on examples to guide Claude, when using tools, the description is one of the most important pieces of information." That's literal — descriptions absorb the prompt-engineering effort you'd normally put into a system prompt for plain text generation.

A useful rule of thumb

If you can write your tool's description in fewer than 50 words, you probably haven't written it carefully enough. A good description usually has three pieces: what the tool does (one sentence), when to use it (positive and negative examples), and how to format the inputs (per-parameter, in their own description fields). All three matter; skip any one and you'll see bad calls.

Parameter descriptions matter just as much

The same principle applies one level down. Every parameter in the schema has its own description field, and the model reads it. If your query field has no description, the model has to guess what a good query looks like. If it has a description that includes a concrete example ("e.g., 'how to configure autovacuum naptime'"), the model anchors on that example and produces queries shaped like it.

This is the single biggest leverage point for tool quality. Carefully written parameter descriptions reduce the rate of malformed arguments by an order of magnitude.

Strict mode (both providers now have it)

Both Anthropic and OpenAI support a strict mode that guarantees the model's tool-call arguments will exactly match your JSON Schema. Set it on whenever the schema is well-defined. The OpenAI version requires additionalProperties: false and every property in required; Anthropic's is more lenient. When strict mode is on, you don't have to validate the arguments your handler receives — the API has already done it.

The case for strict mode is simple: it eliminates a whole class of bugs (the model returning "limit": "5" as a string when you wanted an int). The case against: it sometimes constrains the model into worse behavior on truly ambiguous cases. The default should be strict on; turn it off only if you have a specific reason.

Naming

Tool names appear in the model's context and in your code. The conventions that don't bite later:

snake_case, verb_noun. search_docs, fetch_user, send_email. Not DocsSearch or searchDocs or search (too generic).
Distinct prefixes for related tools. user_get, user_create, user_delete. The prefix helps the model group them mentally and helps you organize handlers.
No abbreviations. retrieve_documents rather than ret_docs. The model handles full words better and your future self will thank you.
Match the handler name exactly. If the tool is called search_docs, the Python function is also search_docs. Avoid translation layers between names; they make debugging harder.

One advanced feature worth mentioning: input_examples

As of late 2025 Anthropic added input_examples to tool definitions — concrete examples of how to call the tool, in the same shape as the arguments. This is useful when the schema can't express a usage pattern that the description alone can't fully communicate (e.g., "this optional field correlates with that one").

{
  "name": "create_ticket",
  "input_schema": { ... /* schema */ ... },
  "input_examples": [
    {
      "title": "Login page returns 500 error",
      "priority": "critical",
      "labels": ["bug", "production"],
      "escalation": { "level": 2, "sla_hours": 4 }
    },
    {
      "title": "Add dark mode support",
      "labels": ["feature-request", "ui"]
      // no priority, no escalation — that's the pattern
    }
  ]
}

Examples typically cost 50–200 prompt tokens. Worth it for tools with complex shapes. Not worth it for simple tools where the description and schema are unambiguous.

Question

Why not just trust the model to figure out how to use my tool from a short description?

The model will figure it out... sometimes. The question is what failure rate you'll accept. With a careless description, expect 5-15% of tool calls to be malformed, irrelevant, or skipped when they should have been made. With a careful description, expect <1%. The economics: spending 20 minutes on a careful description saves you hundreds of debugging sessions over the tool's lifetime. It's the highest-leverage 20 minutes you can spend on agent quality.

Question

Tool descriptions are limited to 1024 chars on some Azure deployments. Won't long descriptions hit that?

The 1024-char limit is Azure-specific (last we checked). Direct Anthropic and OpenAI APIs have much higher effective limits — what bites you first is your context budget, not a hard cap. The careful descriptions in this chapter are 200–500 characters; the careless ones are under 50. Fit easily. If you're on Azure and hitting the 1024 cap, your descriptions are likely too long anyway.

Question

Should I generate JSON schemas from Pydantic models, or write them by hand?

Generate them. Pydantic models give you a single source of truth: handler signature, runtime validation, and schema all stay in sync. The Anthropic SDK and OpenAI SDK both accept Pydantic-generated schemas directly. The 20 lines of boilerplate to do this once is the right investment — hand-written schemas drift from handler signatures within a week.

The one case to write by hand: when you want the parameter descriptions and tool description to be different from what Pydantic would generate from docstrings (often you do — Pydantic field descriptions are usually too terse for tool use).

STEP 2

The on-the-wire shape.

Now we zoom into what actually flies over HTTP when the model decides to call a tool. Understanding this is the difference between treating the SDK as magic and being able to fix it when it breaks. Every tool-use bug you'll ever encounter manifests at this level.

The trip is a round trip. You send the model a request that includes tools and a user message. The model sends back a response that may contain one or more tool-call blocks. You execute those calls, package up the results, and send them back as a follow-up turn. The model then either calls more tools or produces a final answer. The structure of tool-call blocks and tool-result blocks is where the two providers diverge cosmetically but agree conceptually.

Anthropic: tool_use and tool_result content blocks

In Anthropic's Messages API, the model's response has a content array. Each entry has a type. Plain text is {"type": "text", "text": "..."}. When the model wants to call a tool, you get one or more tool_use blocks interleaved with the text:

// Response from messages.create when the model calls a tool
{
  "id": "msg_01ABCdef...",
  "model": "claude-sonnet-4-5",
  "role": "assistant",
  "stop_reason": "tool_use",       // ← key signal
  "content": [
    {
      "type": "text",
      "text": "I'll search the docs for that."
    },
    {
      "type": "tool_use",
      "id": "toolu_01XyzAbc...",     // ← critical: the call ID
      "name": "search_docs",
      "input": {
        "query": "autovacuum naptime configuration"
      }
    }
  ],
  "usage": { "input_tokens": 1842, "output_tokens": 47 }
}

Three fields matter on a tool_use block: id (the call ID, used to correlate the result), name (which tool), and input (the arguments, already parsed into a dict). To respond, you append an assistant message containing that exact content array, then append a user message that contains a tool_result block for each tool_use:

// Your follow-up turn
messages.append({
  "role": "assistant",
  "content": response.content    // ← echo back the original blocks unchanged
})
messages.append({
  "role": "user",
  "content": [
    {
      "type": "tool_result",
      "tool_use_id": "toolu_01XyzAbc...",  // ← MUST match the tool_use.id
      "content": "Found 5 results: [chunk_id=routine-vacuuming::5, text=...]"
    }
  ]
})

The tool_use_id on the result must match the id on the original tool_use. This pairing is how Anthropic correlates "you asked me to call X, here's the result of X" across the round trip. Get it wrong and the API returns an error.

OpenAI: function_call items and function_call_output items in the Responses API

OpenAI's modern API is Responses (Chat Completions is on a deprecation curve as of 2026). In Responses, the analog of a content block is an item. The model's response has an output array of items. Function calls are items of type function_call:

// Response from responses.create when the model calls a tool
{
  "id": "resp_5g2a...",
  "model": "gpt-5.5",
  "output": [
    {
      "type": "message",
      "role": "assistant",
      "content": [{"type": "output_text", "text": "I'll search the docs for that."}]
    },
    {
      "type": "function_call",
      "id":      "fc_67d2e3...",          // ← item ID (rarely used)
      "call_id": "call_Co8dkB8h7N...",    // ← CRITICAL: the correlation ID
      "name": "search_docs",
      "arguments": "{\"query\":\"autovacuum naptime configuration\"}"
                                       // ← STRING. you must JSON.parse it.
    }
  ],
  "usage": { "input_tokens": 1842, "output_tokens": 47 }
}

Two things that catch people. First, arguments is a JSON-encoded string, not a parsed object. You need to json.loads(call.arguments) before you can use it. Second, there are two IDs: id (item ID, used internally) and call_id (correlation ID, the one you actually need). Mixing these up is the most common OpenAI tool-use bug and produces the cryptic 400 No tool call found for function call output with call_id error.

To return the result, you send a function_call_output item back as input, keyed by call_id:

// Your follow-up turn in OpenAI Responses
client.responses.create(
    model="gpt-5.5",
    previous_response_id=response.id,    // ← stateful threading
    input=[
        {
            "type": "function_call_output",
            "call_id": "call_Co8dkB8h7N...",   // ← MUST match function_call.call_id
            "output": "Found 5 results: [chunk_id=routine-vacuuming::5, text=...]"
        }
    ],
    tools=TOOLS,
)

The Responses API maintains conversation state server-side via previous_response_id, so you don't replay the full message history. You just send the new function output and the model resumes from where it left off.

Side-by-side: the field correspondence

Call block type

{"type": "tool_use", ...}

{"type": "function_call", ...}

Correlation ID

id (on tool_use)

call_id (NOT id)

Arguments

input — parsed dict

arguments — JSON string

Result block type

{"type": "tool_result", ...}

{"type": "function_call_output", ...}

Result correlation

tool_use_id

call_id

Result payload field

content

output

Continuation

resend full messages[]

previous_response_id

Stop reason for tool

stop_reason: "tool_use"

presence of function_call items

The three bugs at this level

Almost every tool-use bug is one of three.

Bug 1: orphan tool results. You returned a tool_result (Anthropic) or function_call_output (OpenAI) whose ID doesn't match any tool call in the prior turn. The API rejects the request with a 400. Cause is usually one of: you mutated the assistant message before resending it; you accidentally dropped a tool_use block; you used id instead of call_id in OpenAI.

Bug 2: missing tool result for a tool call. The model called three tools in one turn and you only returned two results. Anthropic and OpenAI both require every call to be answered before the next turn. Cause is usually one of: you only iterated over the first tool_use block; you filtered out a call you didn't recognize.

Bug 3: malformed arguments. The model produced JSON that doesn't fit your schema — a string where a number was expected, a typo in an enum value. Without strict mode, this happens occasionally and your handler has to validate. With strict mode on, this is supposed to be impossible — but if you see it anyway, it's almost always because additionalProperties: false isn't set or required doesn't list every field (the OpenAI strict requirements are strict).

When you see 400 No tool call found for function call output on OpenAI, 99% of the time it's because you used function_call.id when you should have used function_call.call_id. The two IDs look similar (fc_abc... vs call_abc...) and the field name id feels like the obvious choice. It isn't.

The minimal correct dispatcher, both providers

For reference. The 30 lines of code that handle the round trip correctly. You'll write this exactly once per agent.

# agent/dispatch.py — Anthropic
async def handle_turn(messages, response):
    """Given an assistant response with tool_use blocks, run handlers
    and return the next user message with tool_results."""
    if response.stop_reason != "tool_use":
        return None  # no tools called; agent is done

    # 1. Echo back the assistant content unchanged
    messages.append({"role": "assistant", "content": response.content})

    # 2. Run EVERY tool_use block and collect results
    results = []
    for block in response.content:
        if block.type != "tool_use":
            continue
        try:
            result = await HANDLERS[block.name](**block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(result),
            })
        except Exception as e:
            # Return errors as tool_results, not raises. The model can recover.
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": f"Error: {type(e).__name__}: {e}",
                "is_error": True,
            })

    # 3. Send all results back as one user message
    messages.append({"role": "user", "content": results})
    return messages

# agent/dispatch.py — OpenAI Responses
async def handle_turn(previous_response_id, response):
    """Given a response with function_call items, run handlers
    and return inputs for the next responses.create() call."""
    function_calls = [item for item in response.output
                      if item.type == "function_call"]
    if not function_calls:
        return None  # no tools called; agent is done

    # Run EVERY function_call and collect outputs
    outputs = []
    for call in function_calls:
        args = json.loads(call.arguments)  # ← string → dict
        try:
            result = await HANDLERS[call.name](**args)
            outputs.append({
                "type": "function_call_output",
                "call_id": call.call_id,  # ← NOT call.id
                "output": str(result),
            })
        except Exception as e:
            outputs.append({
                "type": "function_call_output",
                "call_id": call.call_id,
                "output": f"Error: {type(e).__name__}: {e}",
            })

    # Send outputs as input to the next turn. Server holds state via
    # previous_response_id, so you don't replay history.
    return outputs

This is the heart of any agent loop. Memorize the shape. The rest of the agent is just deciding what tools to expose and what handlers to write.

Question

Why does Anthropic pre-parse input but OpenAI leaves arguments as a string?

Historical reasons. OpenAI's function calling launched with arguments-as-string when the streaming story was less mature; the string form lets you see partial arguments as they stream in. Anthropic launched later with pre-parsed inputs because by then streaming-of-partial-JSON was a solved problem. The OpenAI shape is now legacy weight — it would break too many integrations to change. Just remember to json.loads().

Also note: for fine-grained tool streaming on Anthropic (the feature that streams partial arguments), you get arguments as text deltas and need to assemble them yourself. Same situation, exposed only when you opt in.

Question

If the model calls a tool that doesn't exist (typo in name), what happens?

Strict mode prevents this — the model can only call tools defined in the request. Without strict mode it's still rare because the model is conditioned on the tool list, but it can happen on edge cases. Your dispatcher should treat unknown tool names as an error and return a tool_result with is_error: true and a clear message like "Error: unknown tool 'sarch_docs'. Available tools: search_docs, fetch_doc, send_email." The model can read this and self-correct on the next turn.

Question

Can I respond with a tool_result for some calls and not others?

No. Both APIs require every tool call in a turn to be answered with a result (or error) in the next turn. The reason: the model's next response is conditioned on the assumption that all calls completed; an absent result is a protocol violation. If you genuinely can't run a tool, return an error result. That's what error results are for.

STEP 3

Parallel calls, errors, and the edge cases that bite.

Steps 1 and 2 give you the happy path. Production is where the edges show up. This step walks through the four most common patterns that break naive implementations: parallel tool calls, tool errors, repeated calls, and malformed arguments. Each has a specific shape and a specific fix.

Parallel tool calls

The model can call multiple tools in a single turn. This is good — when the user asks "what's the weather in SF and the time in Tokyo," the model can issue both calls in one round trip instead of two sequential ones, halving latency.

The shape is exactly what you'd expect. Multiple tool_use blocks in the response content (Anthropic) or multiple function_call items in the output (OpenAI). Your dispatcher must handle all of them in one turn. The minimal dispatcher in Step 2 already does this correctly (the for ... in loop), but there's a question of how: sequentially or in parallel?

The default for naive code is sequential — your for loop awaits each handler before starting the next. For independent tools (search and another search; weather and time) this is wasted latency. The fix is asyncio.gather:

# Sequential — naive
for block in tool_use_blocks:
    result = await HANDLERS[block.name](**block.input)
    results.append(# ...)
# 3 tool calls of 400ms each = 1.2s

# Parallel — production
async def run_one(block):
    try:
        result = await HANDLERS[block.name](**block.input)
        return {"type": "tool_result", "tool_use_id": block.id, "content": str(result)}
    except Exception as e:
        return {"type": "tool_result", "tool_use_id": block.id,
                "content": f"Error: {e}", "is_error": True}

results = await asyncio.gather(*[run_one(b) for b in tool_use_blocks])
# 3 tool calls of 400ms each ≈ 450ms (longest one + overhead)

The win is roughly linear in the number of parallel calls. For a tool-heavy agent that fires off 5 retrievals in a research turn, this is the difference between a 5-second wait and a 1-second wait. Free latency.

One subtlety: tools with side effects (writes, sends, deletes) should not be parallelized blindly. If the model fires two delete_record calls in one turn and you run both at once, you might get race conditions. The safe pattern is to run read-only tools in parallel and state-changing tools sequentially. Easy to encode: keep two lists.

Errors as data, not exceptions

The single biggest mindset shift for tool-using agents: tool errors are not failures, they are information the model can act on. When your search_docs handler can't reach the database, the right thing to do is not to crash the agent — it's to return an error result and let the model decide what to do.

The dispatcher from Step 2 already does this: it catches exceptions and returns a tool_result with is_error: true and a human-readable message. What the model sees:

[turn 3]
[assistant]: I'll fetch that document.
→ fetch_doc(doc_id="invoices-2025-q3")

[tool_result for fetch_doc]:
{"is_error": true, "content": "Error: DatabaseTimeout:
 connection to docs-db timed out after 5s. Retry may succeed."}

[turn 4]
[assistant]: That document fetch timed out. Let me try search
instead and see if I can find it that way.
→ search_docs(query="Q3 2025 invoices")

[tool_result for search_docs]:
[5 chunks returned, including invoices-2025-q3 as top result]

[turn 5]
[assistant]: Found it. Your Q3 invoices...

This is the agent recovering on its own. Without error-as-data, your code would have raised an exception, the run would have aborted, and the user would have seen "an error occurred." With error-as-data, the model gets a chance to try a different approach. Modern models are good at this kind of recovery — give them the chance.

What makes error messages useful to the model:

Be specific. "Error: DatabaseTimeout: connection to docs-db timed out after 5s" is actionable. "Error: tool failed" is not.
Include suggested alternatives. "Retry may succeed" or "try search_docs instead" or "the document_id 'foo' was not found; valid IDs start with 'INV-'" — these guide the model toward recovery.
Don't include stack traces. They consume tokens, don't help the model, and may leak implementation details. The exception type and message are enough.
Use is_error: true on Anthropic. It's a structural hint that this result is an error. OpenAI doesn't have an equivalent flag — just put "Error:" at the start of the output string.

The "model called the same tool 5 times" failure

You'll see this in production. The model calls search_docs, gets a result, calls it again with a slightly different query, gets another result, calls it a third time... and a fourth, and a fifth. The trace looks like an agent that lost its mind.

This is almost always one of two causes:

Cause 1: the tool description doesn't include a stop condition. The model doesn't know when to stop searching. Your description says "search the docs for relevant chunks" — it doesn't say "you should rarely need to call this more than twice; if two searches haven't found what you need, the document probably doesn't exist." Add the stop condition to the description; the behavior will change.

Cause 2: the results aren't actually answering the model's question and it doesn't know what else to do. The model is in a loop because it's stuck. The fix is upstream — your retrieval is bad, your corpus is missing the document, the query reformulation isn't helping. The dispatch loop isn't the problem; the loop is a symptom.

The defense in either case is your step budget (chapter 1.1) — your agent loop caps total tool calls at some number (20 is common) and bails out if exceeded. The budget catches the symptom; the description-fix or corpus-fix solves the cause.

Malformed arguments

The model can produce JSON that doesn't fit your schema. With strict mode on, this is rare to the point of being a bug. Without strict mode it happens regularly enough that your handler should be defensive.

The two patterns that work:

# Pattern 1: validate explicitly in the handler
async def search_docs(query: str, section: str | None = None):
    if not isinstance(query, str) or not query.strip():
        raise ValueError("query must be a non-empty string")
    if section and section not in {"admin", "developer", "reference"}:
        raise ValueError(f"section must be one of admin|developer|reference, got {section!r}")
    # ... real work

# Pattern 2: use Pydantic models as the handler signature
from pydantic import BaseModel, Field

class SearchDocsArgs(BaseModel):
    query: str = Field(min_length=1)
    section: Literal["admin", "developer", "reference"] | None = None

async def search_docs(**raw):
    args = SearchDocsArgs(**raw)  # raises if bad
    # ... use args.query, args.section

Either way, validation errors become exception messages that the dispatcher catches and returns as tool errors — which the model can then read and self-correct on the next turn. The pattern is the same as for any tool error: return information, let the model recover.

One subtle thing: max_tokens cutoff during tool argument generation

If the model is generating arguments for a tool and you've set max_tokens too low, the response can end mid-argument with invalid JSON. Both providers warn about this on fine-grained streaming. Symptoms: stop_reason: "max_tokens" instead of "tool_use"; arguments string that doesn't parse as JSON.

The fix is to set max_tokens high enough to comfortably exceed your largest expected tool argument plus the surrounding text. For most agents 4096 is generous; for agents that pass large structured payloads through tools (rare), more.

Question

How does the model decide when to call multiple tools in parallel vs sequentially?

It depends on the model's reading of dependency. If the model thinks the inputs of one call depend on the output of another (e.g., "first look up the user ID, then fetch that user's invoices"), it'll do them sequentially. If it thinks they're independent ("look up the weather in SF and the time in Tokyo"), it'll do them in parallel. You can nudge this with prompt-level guidance — "if calls are independent, batch them in one turn" — but mostly the model gets it right.

One thing that confuses the model is when calls look independent but actually aren't (because of side effects or ordering constraints you didn't tell it about). The fix is to make the constraint explicit in the tool descriptions.

Question

What about tool calls with truly huge results — say, 100KB of retrieved chunks?

Three options, in order of preference:

Summarize before returning. The tool handler does the work of compressing 100KB of raw output into 5KB of the most relevant excerpts. The model rarely needs the raw 100KB; it needs the answer that's in the 100KB.
Return a handle, let the model fetch by ID. Tool returns "saved 100 chunks, IDs are chunk_001..chunk_100"; the model calls fetch_chunk(id) on the specific ones it wants. Costs an extra round trip but saves enormous context.
Programmatic tool calling (Anthropic, late 2025). The model writes a small Python program that calls tools, processes results, and only returns the final summary to the model's context. Best for cases where the intermediate data is large and the final answer is small.

STEP 4

Build intuition by reading traces.

You can read all of the above and still not have working intuition. Intuition comes from reading actual traces and recognizing what each pattern means. This step walks through three traces, increasing in complexity, with annotations that show what to look at and why. By the end you should be able to triage a tool-use bug by reading the trace in 30 seconds.

Trace A: the clean call

The simplest possible interaction: one tool call, one result, final answer. The shape every other case is a variation on.

══ TURN 1 ══════════════════════════════════════════════════════════
[user]: how do I tune autovacuum naptime?

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: tool_use
content:
  [0] text "I'll check the documentation for that."
  [1] tool_use
      id:    toolu_01ABc...
      name:  search_docs
      input: {"query": "autovacuum naptime tuning"}

══ DISPATCH ═════════════════════════════════════════════════════════
→ HANDLERS["search_docs"](query="autovacuum naptime tuning")
← 5 chunks, top result: routine-vacuuming::5 (score 0.87)

══ TURN 2 (sent to model) ═══════════════════════════════════════════
[assistant]: <echo of turn 1 content unchanged>
[user]: tool_result
       tool_use_id: toolu_01ABc...
       content: "Found 5 chunks: [{...routine-vacuuming::5..."

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: end_turn
content:
  [0] text "Autovacuum naptime is controlled by the
            autovacuum_naptime config setting. Default is 1 minute..."

What to look at in trace A

Three things you scan every clean trace for: (1) does stop_reason match what you expect — tool_use for the call turn, end_turn for the final turn; (2) does the tool_use_id on the result match the id from the call turn (this is the orphan-result bug); (3) is the tool input shape what you expected — query is a non-empty string, no surprises. If all three pass, the trace is clean.

Trace B: parallel calls with one error

Two tools fired at once, one of them errors, the model recovers. This is the recovery loop in action.

══ TURN 1 ══════════════════════════════════════════════════════════
[user]: what's the weather in Tokyo and the current time there?

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: tool_use
content:
  [0] text "I'll get both."
  [1] tool_use  id=toolu_W1...  name=get_weather    input={"location":"Tokyo"}
  [2] tool_use  id=toolu_T1...  name=get_time       input={"location":"Tokyo"}

══ DISPATCH (parallel) ══════════════════════════════════════════════
asyncio.gather(
  get_weather("Tokyo"),  → "62°F, partly cloudy"
  get_time("Tokyo"),     → ✗ TimeAPIError: rate limit exceeded
)

══ TURN 2 ═══════════════════════════════════════════════════════════
[user]: [
  tool_result  tool_use_id=toolu_W1...  content="62°F, partly cloudy"
  tool_result  tool_use_id=toolu_T1...  content="Error: TimeAPIError:
              rate limit exceeded. Retry in 30s recommended."
              is_error: true
]

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: tool_use
content:
  [0] text "Weather is in; time API is rate-limited. Let me try
           an alternative."
  [1] tool_use  id=toolu_T2...  name=get_time
                input={"location":"Asia/Tokyo"}
                                // the model tried a different format

══ DISPATCH ═════════════════════════════════════════════════════════
→ get_time(location="Asia/Tokyo")  → "11:42 PM JST"

══ TURN 3 ═══════════════════════════════════════════════════════════
[user]: tool_result  tool_use_id=toolu_T2...  content="11:42 PM JST"

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: end_turn
"It's 62°F and partly cloudy in Tokyo, currently 11:42 PM local time."

What this trace teaches

The model used the error message ("rate limit exceeded, retry in 30s") to decide what to do — it didn't retry immediately (smart, it heeded the 30s suggestion), it tried a different input format instead. That's a recovery the agent could only make because the error was returned as data with useful context. If your handler had raised an exception that aborted the run, the user would have seen "an error occurred." Instead they got the right answer with a tiny extra latency.

The lesson: every minute you spend writing better error messages compounds over thousands of runs. The model uses what you give it.

Trace C: the malformed call you'll see in production

Without strict mode, this is the pattern that catches teams off guard. The model produces an argument that looks valid but isn't.

══ TURN 1 ══════════════════════════════════════════════════════════
[user]: search for the top 5 admin docs about backups

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: tool_use
content:
  [1] tool_use
      id:    toolu_5_X1...
      name:  search_docs
      input: {
        "query": "backups",
        "section": "administrator",    // schema enum is ["admin","developer","reference"]
        "limit": "5"                   // schema doesn't define this; type is string anyway
      }

══ DISPATCH ═════════════════════════════════════════════════════════
search_docs(query="backups", section="administrator", limit="5")
→ ✗ ValueError: section must be one of admin|developer|reference,
                got 'administrator'

══ TURN 2 ═══════════════════════════════════════════════════════════
[user]: tool_result
       tool_use_id: toolu_5_X1...
       content: "Error: ValueError: section must be one of
                admin|developer|reference, got 'administrator'.
                Did you mean 'admin'?"
       is_error: true

══ MODEL RESPONSE ═══════════════════════════════════════════════════
stop_reason: tool_use
content:
  [0] text "Right, the enum is 'admin'. Retrying."
  [1] tool_use  id=toolu_5_X2...  name=search_docs
                input={"query":"backups","section":"admin"}

══ DISPATCH ═════════════════════════════════════════════════════════
search_docs(query="backups", section="admin")  → 5 chunks

══ TURN 3 ═══════════════════════════════════════════════════════════
[end_turn answer with the backup docs]

Three things to notice

First, the model expanded "admin" to "administrator" — a reasonable natural-language inflation that violated the strict enum. Without strict mode, the API let this through; the validation happened in the handler.

Second, the model also tried to pass "limit": "5" — a parameter not defined in the schema, with a value that's a string instead of a number. This is the kind of thing that often goes unnoticed (the handler just ignores unknown kwargs) but can cause subtle bugs.

Third, the recovery worked because the error message was good. "Did you mean 'admin'?" is the single most useful phrase in a tool error. Add it whenever you have a small set of valid values; the model picks up on it.

The fix at the source: turn on strict mode and add additionalProperties: false to the schema. Both problems disappear at the protocol level.

The five-second triage protocol

When you get pinged with "the agent did something weird," here's the order to scan a trace. Practiced agent engineers do this in under a minute.

┌─────────────────────────────────────────────────────────────────┐ │ 1. What was the stop_reason on the FINAL turn? │ │ - end_turn → agent finished. Read the final text. │ │ - max_tokens → answer truncated. Raise max_tokens. │ │ - tool_use → agent stuck in loop. Check step budget. │ │ - refusal → safety triggered. Read the refusal text. │ │ │ │ 2. For each tool_use block in the trace: │ │ - Does the input match the schema? No → fix description │ │ or turn on strict mode. │ │ - Does the tool_result/function_call_output ID match? │ │ No → orphan bug. Check your dispatcher's ID handling. │ │ - Was the result is_error: true? Yes → check the message. │ │ If it's actionable, did the model recover well next turn? │ │ │ │ 3. Look at the tool_use names in order. Does the sequence make │ │ sense? │ │ - Same tool called >3 times → description has no stop │ │ condition, or retrieval is failing upstream. │ │ - Tool sequence doesn't match user intent → tool │ │ descriptions are ambiguous; the model picked the wrong │ │ one. │ │ - Expected tool never called → its description didn't │ │ trigger; rewrite to be more specific about when to use. │ └─────────────────────────────────────────────────────────────────┘

That's the chapter. The protocol-level mental model — what a tool definition contains, what flies over HTTP, how parallel and error cases work, and how to read a trace fast — is the unlock for every later chapter. When chapter 1.1 walks you through building "the smallest possible agent," you'll now understand why the loop is shaped the way it is. When chapter 2.1 wraps spans around tool_use blocks, you'll know what every span attribute means. When chapter 2.3 talks about tool-result injection, you'll understand exactly what the attack surface is.

The fastest way to internalize this material is to spend an afternoon reading traces from your own agent. Pick 10 successful runs and 10 failed runs. For each, run through the 3-step triage above. The fifth or sixth trace will start clicking. By the tenth you'll be triaging in seconds.

Question

I keep using SDKs that hide all this. Do I really need to understand the wire format?

For day-to-day work, no — the SDK abstractions are good. For debugging, yes. The SDK gives you objects with nice attribute access; the moment something goes wrong, you'll be looking at raw JSON from the API response trying to understand why the SDK is throwing. The thirty minutes you spend internalizing the wire shape pays back the first time you're staring at a 400 error at 11pm.

The other reason: SDKs differ. The Anthropic SDK and OpenAI SDK have different ergonomics, and tools like Vercel AI SDK / LangChain / LiteLLM add their own layers. Knowing the underlying protocol means you can debug any of them without having to learn each SDK's idiosyncrasies.

Question

How does this map to MCP (Model Context Protocol)?

MCP is one layer above. The on-the-wire shape we covered is between your code and the model API. MCP is between your code (acting as an MCP client) and an MCP server (which provides the tools). MCP servers expose tools to your code; your code then forwards those tools' definitions to the model API in the shape we covered.

The relationship: MCP gives you a way to consume tools that someone else maintains. The model API gives you a way to let the model call those tools. They're complementary, not alternatives. We touch this lightly in chapter 4.x; the full MCP story would be its own chapter.

Question

Anthropic has tool_search and programmatic tool calling now. Do I need to learn those?

Eventually, but not yet. Both are recent additions (late 2025) optimized for specific scaling problems: tool_search is for agents with hundreds or thousands of tools where shipping all the definitions in every request would blow your context budget; programmatic tool calling is for cases where tool I/O is huge and you want intermediate data to stay out of the model's context entirely.

If your agent has <50 tools and you're not passing megabytes through them, the basic patterns in this chapter are what you need. Learn the advanced features when you hit the specific scaling problem they solve, not preemptively.

End of chapter 0.3

Deliverable

A working mental model of tool use at the protocol level — the shape that every later chapter assumes. You can debug a "tool was called wrong" issue from a trace alone. You write tool definitions where the description and parameter docs do most of the work, with strict mode on by default. You know the difference between Anthropic's tool_use_id and OpenAI's call_id and won't confuse the two at 2am. You treat tool errors as data the model can recover from, not as exceptions that abort the run. The protocol below the SDK is now visible to you.

Tool definitions with careful descriptions (≥3 sentences) and per-parameter docs with examples
Pydantic models or hand-written schemas; strict mode enabled by default
Dispatcher that runs all tool calls in a turn, returns errors as is_error tool_results
Parallel execution for independent tools via asyncio.gather; sequential for state-changing ones
Validation in handler signatures (Pydantic) that produces actionable error messages
Awareness of the three common bugs: orphan results, missing results, malformed args
Triage protocol practiced on 10+ real traces from your own agent
Mental side-by-side of Anthropic vs OpenAI shapes; you can swap providers in <30 min