Cost and latency: where the dollars and the seconds actually go.
Most production agent problems eventually surface as either "this is too expensive" or "this is too slow." Both have the same underlying cause — a misunderstanding of where the cost and latency actually live inside an agent loop — and the same fix shape: build an accurate mental model first, then apply targeted techniques. This chapter shows you the breakdown of a real agent turn (the answer is not what most people guess), then teaches the three highest-leverage optimizations (prompt caching, the model cascade, and parallel/streaming patterns) with the discipline to measure each one. By the end you'll know exactly which lever to pull for a given problem, and how much to expect from pulling it.
Where the dollars and seconds actually go.
Before you can optimize, you have to know where the bill is coming from. The intuition most engineers carry — that output tokens dominate cost and that tool execution dominates latency — is wrong, and being wrong about it leads to optimization effort spent in the wrong places.
The shape of a real agent turn
A concrete example: a research agent answering a question. Four model calls, three tool calls, ending in a final answer. Here's where the money goes:
Turn 1: User asks "How do I tune Postgres autovacuum for write-heavy workloads?"
──────────────────────────────────────────────────────────────────────
Step Type Input tokens Output tokens Cost (Sonnet 4.5) Latency
──────────────────────────────────────────────────────────────────────
1 model call 3,200 120 $0.011 1.8s
(decides to search)
2 tool — — $0.00 0.4s
(search_docs)
3 model call 4,800 80 $0.015 2.1s
(decides to fetch one)
4 tool — — $0.00 0.6s
(fetch_doc)
5 model call 7,200 90 $0.022 2.6s
(decides to search again)
6 tool — — $0.00 0.5s
(search_docs)
7 model call 9,400 640 $0.038 4.1s
(synthesis: final answer)
──────────────────────────────────────────────────────────────────────
TOTAL 24,600 930 $0.086 12.1s
──────────────────────────────────────────────────────────────────────
Stare at that table for a minute. Three things stand out:
Input tokens dominate. 24,600 input vs 930 output. The ratio is 26:1. At Sonnet pricing of $3 / $15 per million tokens (5× input-to-output ratio), input tokens still account for 84% of the bill ($0.074 of $0.086). Most cost optimizations should target input, not output. The intuition "output tokens are 5× more expensive" is true per-token, but in agent loops you generate 20× fewer of them, so input wins the overall bill.
Input tokens grow with every turn. The first model call uses 3,200 input tokens; by the synthesis step it's up to 9,400 — almost 3×. Why? Because every turn carries forward the entire conversation history including all prior tool results. Each tool result added to the context becomes input on every subsequent model call. Context is paid for on every turn it exists, not once.
Tool execution is fast; model calls are slow. Model calls account for 10.6s of the 12.1s total (88%). Tools account for 1.5s. If you want to make the agent faster, optimizing the model calls (or running them in parallel) is the lever; tool optimization is rounding error.
The cost equation in one line
For any agent turn, the cost is approximately:
cost ≈ Σ (input_tokens_at_step_i × input_price)
+ Σ (output_tokens_at_step_i × output_price)
The dominant term is almost always the first sum — and the steps
near the end of the loop dominate it because they carry the most
context forward.
The four levers that move this equation, in order of effectiveness for a typical agent:
- Prompt caching on the stable prefix (system prompt, tool definitions, retrieved documents that persist). 90% discount on cached input tokens. Step 2.
- Model cascade: use Haiku for the routing steps where Sonnet is overkill. Step 3.
- Context budget discipline: keep accumulated context from ballooning across turns (summarize history, drop stale tool results).
- Output budget discipline: cap
max_tokensappropriately; structured outputs are usually shorter than free-form.
For latency, the levers are different:
- Streaming: start showing output before the model finishes. Doesn't reduce total latency but makes perceived latency much better. Step 4.
- Parallel tool calls: when the model fires multiple independent tools in one turn, run them concurrently. Step 4.
- Model cascade: same lever as cost — smaller models are also faster. Step 3.
- Prompt caching: also reduces latency on cache hits (the model doesn't have to re-process cached tokens). Step 2.
Notice that prompt caching and model cascade appear on both lists. They're the two highest-leverage techniques — applying either reliably improves both cost and latency. The rest of this chapter walks through them in depth.
What to measure before optimizing
Before applying any technique, instrument these four numbers for every model call. Without them you can't tell whether an optimization helped or hurt:
# For every model call, log: { "step": "synthesis", "model": "claude-sonnet-4-5", "input_tokens": 9400, "cache_creation_input_tokens": 0, # charged at 1.25x "cache_read_input_tokens": 3200, # charged at 0.1x "output_tokens": 640, "latency_ms": 4100, "latency_to_first_token_ms": 820, # if streaming "cost_usd": 0.0285, # computed from above }
Four metrics, recorded as span attributes (chapter 2.1's observability infrastructure). The two cache-related metrics are the ones most teams miss; without them you can't compute your effective input price or your cache hit rate. Once you have these numbers, every optimization claim becomes testable.
The single most common cost-optimization mistake: spending time tuning max_tokens to reduce output costs while the input bill is 20× larger and untouched. The intuition "output is more expensive per token" leads people here, but it's a fraction of the total spend. Always measure first; optimize where the money actually is.
No, but you have a different shape of agent. If your output is 90% of cost, you're either generating very long content (writing, code generation, long-form synthesis) or your agent loop is short (1-2 turns) with minimal accumulated context. Both are valid shapes; the breakdown just looks different.
The general principle still holds: measure where your cost is, then optimize there. The 20:1 input:output ratio is typical for multi-turn research/assistant agents. Single-turn generation agents skew the other way. Compute your actual ratio before choosing techniques.
They're typical for non-cached, non-streaming Sonnet calls with a few thousand input tokens. Latency scales roughly linearly with input length (more tokens = more processing) plus the time to generate output tokens (about 30-80 tokens per second on Sonnet depending on load). A 4s synthesis call producing 640 output tokens at ~150 tokens per second of generation, plus ~1s of input processing, is within the expected envelope.
Two things shift these numbers significantly: cache hits cut input processing time by ~70%, and streaming makes the user perceive latency starting at first-token-out (typically 0.5-1s) instead of last-token-out. Both are covered later in this chapter.
Prompt caching: the single biggest cost lever.
Prompt caching is the optimization that most teams under-invest in, despite the fact that it's the single biggest cost lever available. Done right, it cuts input token costs by 60–90% on agent workloads. Done wrong, it does nothing and you don't notice because the API doesn't error — your bill is just higher than it could be.
How it works, mechanically
The mechanism is simple but worth understanding precisely: when you mark a portion of your prompt as cacheable, Anthropic stores the intermediate computation (the key-value attention cache, or "KV cache") for that prefix. The next request with the same prefix doesn't need to re-process it — the model resumes from the cached state.
The pricing as of 2026:
The break-even math: a 5-minute cache pays off after a single read. A 1-hour cache pays off after two reads. Both are nearly always worth it for any prefix that gets reused.
Important constraints:
- Minimum prefix size. Sonnet/Haiku: 1,024 tokens. Opus: 4,096 tokens. Below the minimum, your
cache_controldirective is silently ignored. (Most production system prompts exceed this; very short prompts don't benefit from caching anyway.) - Exact prefix match. The cache hits only when the cached prefix is byte-identical to a stored entry. One extra space, one different timestamp, one rotated model snapshot — cache miss.
- Up to 4 cache breakpoints per request. You can cache several segments independently if your prompt has multiple stable layers.
The minimal-correct API call
Two ways to use it, both small additions to the standard call:
# Automatic caching: simplest, recommended starting point response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, cache_control={"type": "ephemeral"}, # default 5-min TTL system=LARGE_SYSTEM_PROMPT, tools=TOOLS, messages=messages, ) # Anthropic decides where to put the breakpoint (typically end of system+tools). # Explicit breakpoints: fine-grained control response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, system=[ {"type": "text", "text": LARGE_SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}} ], tools=TOOLS_WITH_CACHE, # cache_control on the last tool messages=messages, )
The response includes usage.cache_creation_input_tokens and usage.cache_read_input_tokens — the numbers that tell you whether caching is actually working. Always log these.
The architecture that makes caching pay off
The mistake most teams make is sprinkling cache_control across their code and assuming caching is now happening. It might be. It's also possible nothing is being cached because the "stable" content shifts on every request. The discipline is to structure your prompts as a stacked hierarchy of stability:
Concretely: put your system prompt and tool definitions first, with a cache breakpoint right after them. Conversation history goes next, with another breakpoint after the second-most-recent turn (so the recent turn is fresh but everything older is cached). The current user message is last and never inside a cache region.
This structure typically achieves 60–80% cache hit rates on production agent workloads, which translates to roughly 50–70% reduction in total input token costs after accounting for cache write overhead.
The five anti-patterns that quietly break caching
The reason most "caching is enabled" implementations don't deliver the 90% savings: one of these five patterns is silently invalidating the cache on every request.
Anti-pattern 1: timestamps in the prefix. A line like "Current time: 2026-04-17T14:32:15Z" at the start of your system prompt changes on every request. Cache key changes → cache miss on every call. The fix: either drop the timestamp (does the model actually need second-precision?), truncate it to the day ("Date: 2026-04-17" only changes daily), or move it out of the cached region into the user message.
Anti-pattern 2: user-specific content in the cached prefix. Putting "You are helping {user.name} who works at {user.company}" in the system prompt makes every user get their own cache entry. Fine if you have hundreds of requests per user; expensive if you have thousands of users with few requests each. The fix: move user-specific content to the user message; keep the cached prefix user-agnostic.
Anti-pattern 3: inconsistent whitespace. Your prompt builder strips trailing whitespace inconsistently — sometimes there's a newline at the end, sometimes not. The cache hash doesn't care that the difference is invisible; it sees a different prefix. The fix: aggressively normalize. Always strip, always join with a single style of separator. A 30-line prompt-canonicalization function pays for itself within a day.
Anti-pattern 4: model snapshot rotations. When Anthropic releases a new minor snapshot of Sonnet, your cached entries on the old snapshot become useless. Cache writes cost extra; a snapshot rotation creates a temporary cost spike while caches warm back up. The fix: pin model snapshots in production (claude-sonnet-4-5-20250929 not claude-sonnet-4-5) so rotations are explicit choices, and plan cache warmup alongside any model upgrade.
Anti-pattern 5: prefix shorter than the minimum. You add cache_control to a 600-token system prompt and expect caching. The minimum is 1,024 tokens; nothing is cached. The API doesn't error — it just processes at full price. Always verify with the response's cache_creation_input_tokens field, which is non-zero only when caching actually happens.
Measuring whether caching is working
The metric to track in your observability dashboard:
cache_hit_ratio = cache_read_input_tokens
────────────────────────────────────────────
cache_read_input_tokens + standard_input_tokens
Healthy production workloads: 50–80%
Below 20%: prefix design is broken; audit for anti-patterns
Above 90%: probably leaving even more on the table; investigate
if user-message tokens could be reduced too
The number alone isn't enough — track it over time and check for drops. A sudden cache hit ratio drop usually means an anti-pattern just crept in (someone added a timestamp, a model snapshot rolled, a deploy changed prompt formatting).
The 5-minute vs 1-hour TTL decision
Anthropic's default cache TTL is 5 minutes. There's also a 1-hour TTL at 2× write cost. When to use which:
5-minute TTL for interactive workloads where you expect rapid follow-up requests. A chat agent gets many requests within minutes of each other; a 5-minute window captures most of them. If the user goes away and comes back 20 minutes later, you pay the cache write again — but that's rare and the 5-minute writes are cheap (1.25×).
1-hour TTL for batch workloads where the same context is used across many requests spaced minutes apart. A document-processing pipeline that takes 45 minutes to chew through 10,000 documents benefits from 1-hour TTL because the system prompt stays warm the entire run. Worth the 2× write because you'll get 10,000 reads at 0.1×.
Don't reach for 1-hour TTL by default. The 2× write penalty stings if your workload isn't actually long-lived; if you misjudge, 5-minute would have been cheaper.
The real numbers, from a realistic workload
To anchor expectations: a customer-support agent handling 1,000 conversations a day, each averaging 4 turns, with a 2,000-token system prompt and accumulated context growing from 2K to 8K across turns. Doing the math at Sonnet pricing:
Without caching: System prompt (2K) sent on every call: 2K × 4 turns × 1000 conv = 8M tokens Other input (~5K avg per call): = 20M tokens Total input: 28M tokens × $3/M = $84/day Output (~500 tokens × 4 calls × 1000): 2M × $15/M = $30/day ──────── Total: $114/day, $42K/year With 5-minute caching (assuming 70% cache hit ratio): Cached input: 0.70 × 28M = 19.6M tokens × $0.30/M = $5.88/day Standard input: 0.30 × 28M = 8.4M × $3/M = $25.20/day Cache writes: ~2.8M tokens × $3.75/M = $10.50/day Output: same = $30/day ──────── Total: $71.58/day, $26K/year Savings: $42 - $26 = $16K/year, ~38% reduction With model cascade on top (next step): ... ~$45/day, ~$16K/year (further 37% reduction)
The cache savings alone are real and meaningful. Stacked with model cascade (Step 3), production agents typically end up at 20–30% of their unoptimized cost.
If you implement one thing from this chapter this week, implement prompt caching on your largest stable prompt prefix. It's a single API parameter, the discount is real (60–80% on input), and the only thing standing between you and the discount is not auditing your prefix for the five anti-patterns. Set aside an hour, do the audit, deploy. Most teams see meaningful bill reduction within 24 hours.
OpenAI introduced prompt caching in late 2024 with similar mechanics — a discount (50% as of early 2026, not 90%) on cached input tokens, with automatic detection of repeated prefixes. The architecture decision is the same (stable prefix, volatile suffix); the discount magnitude is smaller. The same anti-pattern audit applies.
Across providers, the principle is becoming standardized: structure your prompts to maximize prefix stability and you get a discount. The specific multipliers vary; the discipline doesn't.
Two scenarios. First, if your workload genuinely has no repeated prefix — every request is unique end-to-end — you pay the 1.25× write multiplier on cache writes that never get read. Cost goes up by 25% on the input side. Unlikely in real agent workloads (system prompts and tool definitions almost always repeat), but possible if you've architected unusually.
Second, if you have hundreds of distinct cache prefixes that each get written once and read once or twice. The write overhead dominates and the math just barely works out. The fix: consolidate prefixes (fewer unique system prompts) or accept that this particular workload doesn't benefit from caching.
If you want to be defensive: instrument the metric, and if your cache hit ratio is below 20% for two weeks, turn caching off until you've fixed the prefix design.
Sort of, indirectly. Tool results become part of the message history, which then becomes input on the next turn. If your message history fits within a cache region (with a breakpoint after it), then yes — older tool results stay cached and get the 90% discount. If your message history is the last thing before the volatile user message, then your tool results are in the cached region.
The pattern: cache breakpoint after message history, before the new user message. That way each new turn pays full price only on the new user message and any tool calls that turn, while everything that came before is on the cache discount.
The model cascade: smaller models for the easy steps.
Most agents use one model for everything. They call claude-sonnet-4-5 for the planning step, the tool-decision step, the synthesis step, and the formatting step. This is the easy way to build, and for low-volume agents it's fine. For production agents it leaves serious money on the table.
The insight is simple: different steps of an agent loop have different difficulty. Deciding whether to search the docs is easy. Synthesizing a final answer from retrieved chunks is hard. Using the same model for both means you're paying Sonnet prices for tasks Haiku could handle equally well.
The model lineup, with prices and speeds
The current Anthropic and OpenAI lineups, with rough characterizations:
The 3-5× price differential between Haiku and Sonnet is the lever. If half your model calls in an agent loop can move to Haiku without quality loss, you save substantial money without changing anything else.
Which steps can move to a smaller model?
The framework that works in practice: match the model to what the step actually needs to do. Three categories:
Routing / dispatch / classification steps → smallest model that works. "Should we search docs, search the web, or answer directly?" "Is this question about billing, technical support, or general?" "Does this query require multiple sources?" These steps have low-cardinality outputs (3-5 options) and don't require deep reasoning. Haiku handles them at quality indistinguishable from Sonnet, and 3× cheaper.
Tool-decision steps → mid-tier model (or smallest that works). "Given this user message and these tools, which tool should I call with what arguments?" Slightly harder than pure classification because argument formation requires understanding the question. Haiku usually works; check on your eval set. If Haiku misroutes more than 5-10% of cases, move to Sonnet.
Synthesis / final-answer steps → biggest model that fits your budget. Writing the final answer that the user sees. This is where quality is most visible, where bad outputs hurt user satisfaction, and where the marginal cost of a better model is most justified. Default to Sonnet here. Reach for Opus when synthesis is genuinely hard (multi-document reasoning, code review).
The concrete pattern in code
# agent/loop.py — model cascade ROUTING_MODEL = "claude-haiku-4-5-20251001" # cheap, fast SYNTHESIS_MODEL = "claude-sonnet-4-5" # quality async def run_agent(user_message: str): # Step 1: route — which path does this question need? # Haiku is enough: this is a classification. route = await client.messages.create( model=ROUTING_MODEL, max_tokens=50, system=ROUTING_PROMPT, messages=[{"role": "user", "content": user_message}], ) path = parse_route(route) # "search" | "calculate" | "direct" # Step 2: tool-using loop. Use Haiku — it can call simple tools fine. tool_results = [] for step in range(5): response = await client.messages.create( model=ROUTING_MODEL, # still Haiku for tool decisions max_tokens=1024, tools=TOOLS, system=TOOL_PROMPT, messages=build_messages(user_message, tool_results), ) if response.stop_reason != "tool_use": break # dispatch tools, collect results tool_results.extend(await dispatch_tools(response)) # Step 3: synthesis — write the final answer. Use Sonnet. # This is where quality matters most. final = await client.messages.create( model=SYNTHESIS_MODEL, max_tokens=2048, system=SYNTHESIS_PROMPT, messages=build_synthesis_messages(user_message, tool_results), ) return final.content[0].text
Three model calls, three different choices. The routing and tool-use steps run on Haiku; only the user-visible synthesis runs on Sonnet. For an agent that previously ran everything on Sonnet, this is typically a 50–60% cost reduction with no quality regression on a well-designed eval suite.
How to validate the cascade on your evals
The right way to roll out a model cascade: keep Sonnet as baseline, swap in Haiku for one step at a time, run your eval suite (chapter 3.1), measure the delta. If the metric you care about (typically trajectory_pass_rate or your equivalent) doesn't move, the swap is safe. If it drops, the step needs the bigger model.
The order to try, lowest risk first:
- Routing/classification steps (lowest risk). Haiku handles these with negligible quality loss.
- Simple tool calls with structured output (search_docs, get_user, etc.). Haiku usually works.
- Multi-step tool sequences where the model needs to reason about prior tool results. Haiku sometimes degrades; measure.
- Synthesis steps with simple inputs. Maybe Haiku works for short factual answers.
- Synthesis steps with complex inputs. Probably needs Sonnet. Eval will tell you.
The discipline: every cascade decision is an eval question, not an intuition question. "Can Haiku handle this step?" gets answered by running the eval suite with Haiku on that step and comparing to the Sonnet baseline. If the answer is "yes, score didn't move," ship the cheaper version. If it's "no, score dropped 2 points," keep Sonnet.
The reasoning-model exception
One additional axis: reasoning-enabled models (Opus 4.7 with extended thinking, GPT-5 reasoning mode). These are slower and pricier even than their non-reasoning counterparts because they generate hidden "thinking" tokens before responding. Worth using only for steps where the extra reasoning actually changes the answer — typically complex planning or multi-step logical inference.
The mental model from chapter 0.1 applies: extended thinking helps on hard reasoning problems, hurts on simple tasks. In the cascade, this means:
- Routing/classification: never reasoning mode. Haiku or non-reasoning Sonnet.
- Tool decisions: rarely reasoning mode. Only if your agent does complex multi-step planning.
- Synthesis: depends on complexity. Reasoning for code review, math, multi-doc analysis. Non-reasoning for everything else.
A pragmatic default for new projects
If you're starting a new agent and don't know yet which steps need which model, here's a defensible starting point that you'll tune over time:
Routing / classification: Haiku 4.5 Tool decisions: Haiku 4.5 (start here; upgrade if eval fails) Synthesis: Sonnet 4.5 If you have an Opus budget, use Opus only for: - Synthesis on the highest-stakes paths - Multi-document analysis or code-review style tasks Never default to Opus across the board. The cost compounds badly in agent loops, and the quality gap vs Sonnet rarely justifies it on non-frontier tasks.
The compound effect with caching
Cascade and caching stack multiplicatively. A workload that costs $100/day with no optimization, $60/day with caching alone, and $40/day with cascade alone, costs roughly $25/day with both — because the cascade reduces the model-tier multiplier and caching reduces the input-token multiplier simultaneously. Each does about a 40% reduction; combined they do about 75%.
The dollar amounts depend on your specific workload, but the multiplicative structure is general. If you've implemented one and not the other, you're leaving money on the table that gets recovered by implementing the second.
You can't, reliably. Spot-checking 5 examples doesn't tell you whether the 5th percentile case regressed. The discipline from chapter 3.1 applies: every model swap is a hypothesis with a prediction ("swapping routing to Haiku should not change trajectory_pass_rate by more than 1 point") and a verdict ("ran the eval; pass rate moved 0.3 points, within noise floor of 0.044 → ship it").
If your eval suite is too slow to run on every cascade decision, you have a different problem to solve first (chapter 3.1). The fast subset of the eval (10–15 examples that run in 2–3 minutes) is enough for go/no-go on most cascade decisions.
Generally no, with one specific exception. The reasons not to: different families have subtly different behavior on tool use, structured output, and instruction following. Mixing them inside a loop means debugging two failure modes instead of one, and your tool-call patterns may need provider-specific tweaks.
The exception: when you have a strong reason that one family is meaningfully better at a specific subtask (e.g., GPT-5-mini at very-cheap classification, Claude Haiku at instruction-following with specific output formats). For most teams this isn't worth the complexity. Pick one provider, cascade within their lineup.
Some providers are exploring this (Anthropic's hybrid reasoning modes, where the model decides how much thinking to do). For now, the decision of "which model handles this step" is something you make in code, based on your evals. The model can't self-route at the per-step level — your dispatch code does it.
The future direction is interesting: a single API endpoint that decides per-request how much compute to spend. As of mid-2026 this is partially available (extended thinking with auto-budget) but not yet at the level of "pick between Haiku and Sonnet for this turn." Keep an eye on it.
Latency: streaming, parallelism, and the perceived/actual split.
Cost and latency overlap in their levers but not perfectly. Two patterns matter for latency specifically: streaming (which doesn't reduce total time but improves perceived time dramatically) and parallel tool execution (which does reduce total time, sometimes by a lot). Both deserve their own treatment.
The perceived/actual latency split
The first conceptual move: users care about perceived latency, not actual latency. Perceived latency is how long the user feels they're waiting, which is dominated by the time before they see anything. Actual latency is how long the agent takes from start to finish. These can be very different numbers.
An agent that takes 12 seconds to produce a complete answer, with nothing on screen for the first 11 seconds, feels slow. The same agent that takes 12 seconds to produce a complete answer but starts showing tokens at 0.8 seconds feels fast — even though the actual latency is identical. The user is reading while the agent is generating; their wait time is the time until they have something to read, not the time until generation completes.
This is the fundamental case for streaming, and it's why streaming is non-negotiable in any user-facing agent.
Streaming, from the wire up
Both Anthropic and OpenAI support streaming via Server-Sent Events. The model generates one token at a time and emits each one as it's produced. The client (browser, mobile app, terminal) consumes the stream and displays tokens as they arrive.
# Streaming with the Anthropic SDK async with client.messages.stream( model="claude-sonnet-4-5", max_tokens=2048, messages=messages, ) as stream: async for text in stream.text_stream: # Yield each token to the client as it arrives. yield {"type": "token", "text": text} final = await stream.get_final_message() # final.usage has the token counts; useful for cost tracking
# Streaming with the OpenAI SDK (Responses API) stream = client.responses.stream( model="gpt-5.5", input=[{"role": "user", "content": message}], ) with stream as s: for event in s: if event.type == "response.output_text.delta": yield {"type": "token", "text": event.delta} # final response, with full output and usage final = s.get_final_response()
Two metrics to track on streaming endpoints:
- Time to first token (TTFT): how long until the first piece of output appears. This is the number that defines perceived latency. Healthy: under 1s. Concerning: over 2s. Indicates input-processing speed.
- Tokens per second (TPS): generation speed once tokens start flowing. Sonnet typically runs at 30-80 TPS; Haiku faster. This affects how long the user reads while waiting for the rest. Less critical than TTFT but worth monitoring.
TTFT is the metric to optimize. If your TTFT is 3 seconds, it doesn't matter that your TPS is fast — the user is staring at a blank screen for three seconds and concluding that your product is slow. Reduce TTFT first.
What pushes TTFT down
The largest contributor to TTFT is input token count. The model has to process every input token before it can produce the first output token. For a 10K input prompt, this typically takes 500-1500ms on Sonnet, even before generation starts. Cut that to 3K input via better context discipline, and TTFT drops to 200-500ms.
Other factors:
- Cache hits reduce TTFT substantially because cached tokens skip processing. A request hitting cache typically has TTFT 30–50% lower than the same request as a cold cache write.
- Model size: Haiku's TTFT is generally 30-50% of Sonnet's at equivalent input size.
- Server-side load: provider-side latency varies with how busy their infrastructure is. Less under your control, but the
x-request-idin response headers lets you correlate with their dashboards if something looks anomalous.
Parallel tool calls
The second latency lever is concurrent execution. From chapter 0.3, you know the agent can fire multiple tool calls in a single turn. The question for this chapter is what to do with them: serialize or parallelize?
The naive dispatcher iterates tool calls sequentially:
# Naive: sequential — wastes latency for block in tool_use_blocks: result = await HANDLERS[block.name](**block.input) results.append(result) # 3 tools × 400ms each = 1.2s
The production dispatcher runs them concurrently:
# Production: parallel — bounded latency = max(individual) async def run_one(block): try: result = await HANDLERS[block.name](**block.input) return tool_result_block(block.id, result) except Exception as e: return tool_result_block(block.id, f"Error: {e}", is_error=True) results = await asyncio.gather(*[run_one(b) for b in tool_use_blocks]) # 3 tools × 400ms each ≈ 450ms (longest + overhead)
For a research agent that fires 5 retrieval calls in a single turn, this is the difference between 2 seconds and 400 milliseconds. Free latency improvement — no quality trade.
The caveat: state-changing tools (writes, sends, deletes) shouldn't be parallelized blindly. Two concurrent delete_record calls might race. The safe pattern: separate tools into read-only and state-changing categories; parallelize the former, sequentialize the latter.
The latency budget
Set a per-turn latency budget and enforce it. For interactive agents, 3 seconds is a reasonable ceiling on perceived latency (TTFT under 1s, total response under 3s for short answers; longer is OK if the agent shows progress along the way).
The budget framework:
For a multi-turn agent, the budget at each step: ─ Pre-flight (auth, validation, cache lookup): < 50ms ─ Initial model call (routing/tool decision): < 1500ms ─ Tool execution (in parallel): < 1000ms (max of any individual) ─ Synthesis call: < 2500ms (with TTFT < 800ms) ─ Network/streaming overhead: < 200ms Total budget: ~5s, with TTFT under 1s. For agents that legitimately need more time (deep research, long synthesis), use Pattern B from chapter 2.4 — submit/poll with streaming events. The latency budget then applies per-event, not per-end-to-end-response.
The discipline: instrument every component of the budget. When the agent feels slow, look at the span tree (chapter 2.1) and find the component that exceeded its budget. Usually it's the model call (input too large, no cache hit) or sequential tool execution. Both are fixable.
Speculative execution: when latency really matters
For high-stakes latency situations, an advanced technique: start the next step's work before the previous step finishes, on a guess about what the next step needs.
Example: a research agent's routing step decides whether to search docs, search web, or answer directly. Three options, with roughly equal probability. Instead of waiting for the routing decision before starting any work, you could fire all three searches in parallel — pre-warming them — and then use only the result from the path the routing step actually chose.
This costs 3× the tool execution money but cuts perceived latency to the maximum of (routing call time, tool call time) instead of their sum. Only worth it when latency is genuinely critical and the speculative cost is small (a tool call is way cheaper than a model call, so the trade often works).
For most agents this is over-engineering. For chatbots competing on responsiveness or for trading/real-time agents where every 100ms matters, it's the right move. Don't reach for it preemptively; reach for it when your perceived latency is the limit on user satisfaction and you've exhausted simpler levers.
What the right shape looks like
Pulling it all together, a well-optimized production agent looks like this:
This is the shape that lets you serve real users sustainably. None of these numbers are achievable on day one — you build to them iteratively, using the levers in this chapter, with the measurements from chapter 2.1's observability work to know whether you're moving the right metrics.
The right order to apply the techniques in this chapter, on a fresh agent: (1) instrument the four cost/latency metrics; (2) implement prompt caching on the largest stable prefix; (3) roll out model cascade one step at a time, validated against your eval suite; (4) audit tool dispatcher for sequential-where-it-should-be-parallel; (5) enable streaming if you haven't; (6) set a latency budget per step and alert on breaches. Each step is a one-day investment that pays back within a week of production traffic.
Four-second TTFT is high; expected is under 1s on a cached prefix and under 2s on uncached for typical input sizes. Diagnostic order:
- Check input token count. If you're sending 50K tokens to Sonnet on every request, 2-4s TTFT is the expected price of admission. Reduce input via context trimming or cache hits.
- Check cache hit ratio. If it's near zero, you're paying full processing cost every call. See Step 2 for anti-patterns.
- Check model tier. Opus has substantially higher TTFT than Sonnet, which has higher than Haiku. If you're on Opus, downgrade where you can.
- Check provider-side weather. Occasionally inference clusters are under load. Look at provider status pages and the request-id in response headers.
It can — if you have two code paths for "produce text" vs "stream text," they tend to diverge. The cleanest pattern: always stream internally, and have the non-streaming code path be a thin wrapper that consumes the stream into a single string. Then the agent loop is single-shape (async generator), and "non-streaming" is just a different way of consuming the output.
For tool calls where you don't need streaming, the streaming overhead is negligible (a few ms). The unified-shape benefit far outweighs it.
Batch API gives 50% off both input and output tokens, with a 24-hour processing window. It stacks with prompt caching (so cache reads inside batch are 0.5× × 0.1× = 0.05× of base input). The catch: it's async and slow. You submit a batch, you check back later for results.
For interactive agents, batch is irrelevant — your latency budget is seconds, not hours. For offline workloads (eval suite runs, batch document processing, training data generation, periodic reports), batch can be the right answer. Don't try to retrofit batch into a real-time path; that's not what it's for.
Deliverable
A cost-and-latency-disciplined agent: instrumented metrics on every model call, prompt caching active on the stable prefix with 60-80% hit ratio, model cascade matching tier to task difficulty, parallel tool dispatch where independent, streaming for interactive paths. A clear understanding of where the bill comes from (input tokens, accumulated across turns) and where the seconds come from (model calls, mostly serial). The mental model that turns "optimize cost" and "optimize latency" from vague aspirations into specific levers you know how to pull. The economic substrate that lets the agent serve real users at a real cost basis.
- Token-level instrumentation: input, output, cache_creation, cache_read, latency per model call
- Prompt caching enabled on the largest stable prefix; minimum 1,024 (Sonnet) / 4,096 (Opus) tokens
- Cache hit ratio dashboard with alerts for drops below threshold
- Anti-pattern audit: no timestamps, no user-specific data, normalized whitespace in cached region
- Model cascade: Haiku for routing, Sonnet for synthesis, Opus only when measurably needed
- Each cascade decision validated against the eval suite with predicted-and-measured deltas
- Parallel tool dispatch via asyncio.gather; state-changing tools sequential
- SSE streaming on interactive endpoints; non-streaming as thin wrapper
- TTFT measured and tracked; target under 1s on cached prefix
- Per-turn latency budget defined and enforced; alerts on breaches
- Batch API for offline workloads (evals, document processing); standard for interactive