Eval-Driven Dev — The Agentic AI Field Guide

3.1

Part III / Evaluate · The discipline that compounds

Eval-driven development.

You built your first eval suite in chapter 1.4. This chapter turns that one-off into a daily practice. Score deltas as the unit of progress. Regression budgets that let you trade subscore movements honestly. A PR-comment workflow that puts eval results in front of every code review. The cadence that keeps you from optimizing on vibes. By the end you'll have an agent codebase where no change ships without numbers, and the numbers will compound in your favor over months instead of drifting backwards.

STEP 1

Deltas, not scores.

The first thing to internalize about eval-driven development is that the unit of progress is not the score — it's the score delta across a specific change. If you remember nothing else from this chapter, remember this. The number 73.4% means nothing on its own. The number "+1.8 vs main" means something useful. And the number "+1.8 on retrieval recall but -3.1 on grounding faithfulness" tells you exactly what to do next.

Teams new to evals routinely make the same mistake. They run the suite once, read the headline number, and either celebrate ("we're at 78%, that's pretty good") or despair ("we're at 41%, we're not ready"). Both reactions miss the point. Without a comparison baseline, the score is just a number. Without a per-subscore breakdown, you can't decide whether a change was good. And without a reproducible baseline you can re-measure against, you can't even compute the delta reliably.

The minimum viable scoreboard

What you actually need is a small piece of infrastructure: a CSV (or table, or scoreboard.jsonl) that records, for every meaningful change, the full subscore breakdown alongside the commit SHA. From chapter 1.4 you already have scoreboard.csv — we're going to extend it.

# scoreboard.csv — what e-d-d depends on
commit,timestamp,branch,
  overall,
  retrieval_recall_at_5,
  retrieval_mrr,
  verifier_agreement,
  trajectory_pass_rate,
  trajectory_steps_avg,
  cost_per_run_usd,
  notes
b3a1f7e,2026-04-12T08:14:22Z,main,
  0.732, 0.81, 0.74, 0.88, 0.69, 4.2, 0.041, baseline
c9e2a44,2026-04-12T10:33:01Z,feat/rerank,
  0.768, 0.86, 0.79, 0.87, 0.71, 4.4, 0.044, add cross-encoder rerank
d1f5b88,2026-04-12T14:22:18Z,feat/judge-prompt,
  0.745, 0.81, 0.74, 0.92, 0.71, 4.1, 0.041, tighter verifier prompt

This is everything. Three commits, three rows, every subscore captured. The two changes don't both move the same dial — the rerank change improves retrieval (good) at the cost of slightly more steps and slightly higher cost (bad-ish). The judge-prompt change leaves retrieval alone but raises verifier agreement substantially (good). These are different stories. A single weighted average would have hidden both.

The diff command

Now write the smallest possible delta computer. It takes a commit SHA and shows you what changed since main. Five minutes of code, used a hundred times a week:

# scripts/eval_delta.py
import argparse, csv, sys
from pathlib import Path

p = argparse.ArgumentParser()
p.add_argument("--against", default="main")
p.add_argument("--commit", required=True)
args = p.parse_args()

rows = list(csv.DictReader(open("scoreboard.csv")))

# most recent row on baseline branch
baseline = next(r for r in reversed(rows) if r["branch"] == args.against)
# row for the candidate commit
candidate = next(r for r in rows if r["commit"] == args.commit)

print(f"baseline:  {baseline['commit']} ({baseline['branch']})")
print(f"candidate: {candidate['commit']} ({candidate['branch']})")
print()

METRICS = ["overall", "retrieval_recall_at_5", "retrieval_mrr",
           "verifier_agreement", "trajectory_pass_rate",
           "trajectory_steps_avg", "cost_per_run_usd"]

for m in METRICS:
    b = float(baseline[m])
    c = float(candidate[m])
    delta = c - b
    sign = "+" if delta >= 0 else ""
    arrow = "↑" if delta > 0 else ("↓" if delta < 0 else "·")
    print(f"  {m:32s}  {b:6.3f} → {c:6.3f}  {sign}{delta:+.3f} {arrow}")

$ python scripts/eval_delta.py --commit c9e2a44

baseline:  b3a1f7e (main)
candidate: c9e2a44 (feat/rerank)

  overall                           0.732 → 0.768  +0.036 ↑
  retrieval_recall_at_5             0.810 → 0.860  +0.050 ↑
  retrieval_mrr                     0.740 → 0.790  +0.050 ↑
  verifier_agreement                0.880 → 0.870  -0.010 ↓
  trajectory_pass_rate              0.690 → 0.710  +0.020 ↑
  trajectory_steps_avg              4.200 → 4.400  +0.200 ↓ (worse)
  cost_per_run_usd                  0.041 → 0.044  +0.003 ↓ (worse)

You can ship this change. Retrieval got better in two dimensions, verifier agreement barely dropped (within noise), trajectory pass rate improved, cost rose by ~7%. The trade looks favorable if your cost ceiling has headroom. If your cost ceiling is binding, this is a debate worth having on the PR — which is exactly where this conversation should happen.

Noise floor: how big is "real"?

Here's the next question, and it's the one teams skip: a delta of +0.036 looks real, but is it real, or is it within run-to-run noise? The honest answer is you don't know until you've measured the noise floor of your suite.

Measure it once. Run the same eval against the same commit, 5 times. Look at the variance of each metric.

# scripts/noise_floor.py
import statistics, subprocess, csv

runs = []
for i in range(5):
    subprocess.run(["make", "eval-full"], check=True)
    # reads the row just appended for this commit
    rows = list(csv.DictReader(open("scoreboard.csv")))
    runs.append(rows[-1])

for metric in METRICS:
    vals = [float(r[metric]) for r in runs]
    mean, stdev = statistics.mean(vals), statistics.stdev(vals)
    print(f"{metric:32s}  μ={mean:.3f}  σ={stdev:.3f}  noise=±{2*stdev:.3f}")

$ python scripts/noise_floor.py

overall                           μ=0.732  σ=0.008  noise=±0.016
retrieval_recall_at_5             μ=0.810  σ=0.003  noise=±0.006  ← stable
retrieval_mrr                     μ=0.740  σ=0.005  noise=±0.010
verifier_agreement                μ=0.880  σ=0.018  noise=±0.036  ← noisy!
trajectory_pass_rate              μ=0.690  σ=0.022  noise=±0.044  ← noisy!
trajectory_steps_avg              μ=4.200  σ=0.150  noise=±0.300
cost_per_run_usd                  μ=0.041  σ=0.001  noise=±0.002

What this tells you

Retrieval metrics are stable — they're computed against fixed labels, so they vary only with retrieval randomness (which you can seed). They're trustworthy as small-delta signals. Verifier agreement and trajectory pass rate are far noisier because they involve LLM-as-judge calls and stochastic agent paths. A +0.02 change on trajectory pass rate is well within noise; you cannot interpret it as a real improvement from a single run.

The fix isn't to despair, it's to require multi-run measurement for noisy metrics. Run the suite 3 times for any candidate that targets a noisy metric. Report the mean. Reject single-run "improvements" smaller than 2σ.

Encode the noise floor in your delta tool

Update eval_delta.py to mark each delta as "real" or "within noise" using the measured σ. The version you'll actually use:

NOISE = {
    "overall":                 0.016,
    "retrieval_recall_at_5":   0.006,
    "retrieval_mrr":           0.010,
    "verifier_agreement":      0.036,
    "trajectory_pass_rate":    0.044,
    "trajectory_steps_avg":    0.300,
    "cost_per_run_usd":        0.002,
}

for m in METRICS:
    b, c = float(baseline[m]), float(candidate[m])
    delta = c - b
    real = abs(delta) > NOISE[m]
    flag = "REAL" if real else "noise"
    print(f"  {m:32s}  {b:.3f} → {c:.3f}  {delta:+.3f}  [{flag}]")

$ python scripts/eval_delta.py --commit c9e2a44

  overall                           0.732 → 0.768  +0.036  [REAL]
  retrieval_recall_at_5             0.810 → 0.860  +0.050  [REAL]
  retrieval_mrr                     0.740 → 0.790  +0.050  [REAL]
  verifier_agreement                0.880 → 0.870  -0.010  [noise]
  trajectory_pass_rate              0.690 → 0.710  +0.020  [noise]
  trajectory_steps_avg              4.200 → 4.400  +0.200  [noise]
  cost_per_run_usd                  0.041 → 0.044  +0.003  [REAL]

Suddenly the picture is sharper. The rerank change really improved retrieval, and really raised cost slightly. Everything else is within noise — the apparent verifier drop is meaningless, the apparent trajectory improvement is also meaningless. You'd want a multi-run measurement on those before concluding anything.

This is the discipline. Without it, teams spend weeks "improving" things that were never actually different from the baseline. With it, your eval gets quieter — you stop reading patterns into noise — and the real signals stand out clearly.

Question

Two-sigma is a 95% confidence threshold. Isn't this just stats?

Yes — and saying it plainly: this is just hypothesis testing dressed up for engineers. The reason it's worth labeling rather than assuming everyone knows is that ML teams routinely report single-run improvements as if they were definitive, and the discipline of saying [noise] next to a delta forces the conversation to happen. The math is undergraduate stats; the practice is rare.

You can do this more rigorously (proper t-tests, bootstrap confidence intervals) and people who care should. The 2σ rule is the minimum that catches the most common mistake.

Question

My eval suite takes 20 minutes to run. Running it 3× per candidate isn't feasible.

Three answers, in order of effort:

Split the suite. Most teams have a fast subset (cheap unit/retrieval checks, run on every commit) and a slow subset (full trajectories with judge calls, run nightly or on PR-to-main). Chapter 1.4 already touched this. Fast suite runs once; slow suite runs three times overnight, mean reported.
Parallelize. Eval cases are embarrassingly parallel. With asyncio.gather (chapter 0.4 if it existed, or Anthropic's Message Batches API) you can run 50 trajectories in the time it takes to run one. Suddenly 3× becomes 3× something fast.
Lower the bar. Use 1.5σ instead of 2σ if you accept more false positives. Use single-run for cheap metrics (retrieval) and multi-run only for the LLM-judge ones. Engineering trade-off, not a moral failing.

STEP 2

Regression budgets: making the trade-off explicit.

Step 1 gave you a way to read a delta honestly. Step 2 is about what to do when the delta is mixed — when a change makes some metrics go up and others go down, and you have to decide whether to ship it.

The naive answer is "compute a weighted average and look at the overall score." The reason this is wrong: weighted averages hide tradeoffs you'd rather have explicit conversations about. A change that boosts retrieval by 5 points and tanks faithfulness by 4 points might net out to +1 on a weighted overall, but you almost certainly do not want to ship "your agent retrieves better but makes things up more." Faithfulness is non-negotiable for most products; retrieval is something to optimize within that constraint.

The better mental model is regression budgets. Per metric, decide in advance: this metric must never drop by more than X points; this metric can drop slightly if others rise enough; this metric is free to fluctuate. Encode the budgets as a config. Let the diff tool tell you whether a change fits within them.

Define the budgets

# evals/budgets.yaml
# Per-metric regression budget. Negative numbers are allowed drops.
# "hard" means the merge gate fails if exceeded.
# "soft" means warn but allow.

budgets:
  retrieval_recall_at_5:
    direction: maximize
    hard_floor: -0.01     # never drop by more than 1 point
    soft_floor: -0.005

  verifier_agreement:
    direction: maximize
    hard_floor: -0.005    # faithfulness is non-negotiable
    soft_floor: 0.0       # even noise-level drops get flagged

  trajectory_pass_rate:
    direction: maximize
    hard_floor: -0.02
    soft_floor: -0.01

  trajectory_steps_avg:
    direction: minimize
    hard_ceiling: +0.5    # never run >0.5 more steps on average
    soft_ceiling: +0.2

  cost_per_run_usd:
    direction: minimize
    hard_ceiling: +0.010  # never raise per-run cost by >$0.01
    soft_ceiling: +0.003

The labels matter. Hard floors/ceilings block the merge. CI runs the eval, computes deltas, and if any metric breaks a hard threshold the PR can't ship without an explicit override. Soft floors/ceilings warn. CI annotates the PR with "verifier dropped 0.4 points (within budget but worth a look)" but doesn't block. The distinction matters because some metrics deserve veto power and others don't.

The budget-aware diff

Extend the delta tool one more time:

# scripts/eval_delta.py (with budgets)
import yaml

budgets = yaml.safe_load(open("evals/budgets.yaml"))["budgets"]

def verdict(metric, delta):
    b = budgets.get(metric)
    if not b: return ""
    if b["direction"] == "maximize":
        if delta < b["hard_floor"]:  return "❌ HARD"
        if delta < b["soft_floor"]:  return "⚠ SOFT"
    else:  # minimize
        if delta > b["hard_ceiling"]:  return "❌ HARD"
        if delta > b["soft_ceiling"]:  return "⚠ SOFT"
    return "✓"

violations_hard = 0
for m in METRICS:
    b, c = float(baseline[m]), float(candidate[m])
    delta = c - b
    v = verdict(m, delta)
    if "HARD" in v: violations_hard += 1
    print(f"  {m:32s}  {delta:+.3f}  {v}")

sys.exit(1 if violations_hard > 0 else 0)

$ python scripts/eval_delta.py --commit c9e2a44

  retrieval_recall_at_5             +0.050  ✓
  retrieval_mrr                     +0.050  ✓
  verifier_agreement                -0.010  ❌ HARD
  trajectory_pass_rate              +0.020  ✓
  trajectory_steps_avg              +0.200  ⚠ SOFT
  cost_per_run_usd                  +0.003  ⚠ SOFT

❌ 1 hard violation. PR cannot merge until resolved.

Now the PR conversation has structure. The CI tells the author: your change broke a non-negotiable invariant on faithfulness — discuss with the team before merging, or improve the change so it doesn't regress that metric. Two outcomes are healthy: the author fixes the regression, or the team explicitly decides the budget is wrong and updates it (which is itself a separate PR, separately reviewed). What does not happen is silently shipping a change that drops faithfulness.

The Pareto frontier in your head

Budgets handle the simple cases. The harder cases are when a change crosses a soft threshold but you want to ship it anyway because the gain on another metric is large. This is where you start thinking about Pareto frontiers explicitly.

faithfulness ↑ | 1.0 | · | • A: rerank (current) 0.90 | · • | • • B: rerank + tighter judge | • • 0.80 | • • C: bigger model (cost+++) | • 0.70 | • | •_____________________________ retrieval recall 0.6 0.7 0.8 0.9 1.0 ↑ On the frontier: A, B, C — no other option dominates them. Off the frontier: every other point — strictly worse than something on it.

For two metrics you can literally draw this. For five metrics you can't, but the principle still applies: a change is Pareto-improving if at least one metric improves and none gets worse beyond noise. Pareto-improving changes are unambiguous — ship them. A change that improves some metrics and worsens others is a Pareto trade-off — it sits on the frontier, and shipping it means moving the frontier in a direction the team has chosen.

The discipline this enforces: every non-Pareto-improving merge should require a one-line justification in the PR description. Not a long memo. One line. "Shipping despite the +0.003 cost increase because the +0.05 retrieval gain unblocks the Q3 use cases." If you can't write that line, the change isn't ready.

Question

How do I pick the initial budgets? Won't I just set them around the current values?

That's exactly what you do — and that's fine. The budgets are not aspirational targets; they're "don't slide backwards from here without consciously deciding to." Set hard floors at "current value minus a small margin that comfortably exceeds noise" for everything non-negotiable. Set soft floors slightly tighter.

Once a budget gets violated and the team decides to relax it, that's a signal: either the metric is genuinely less important than thought, or the team is rationalizing a regression. The friction of editing the budget config makes the second case visible.

Question

What if I have 12 metrics and budgets feel overwhelming?

You probably don't need 12 budgets. Pick the 3–5 metrics that genuinely matter to your product. The rest can be soft-tracked (recorded, not budgeted). A common starting set:

One end-to-end task success metric (trajectory_pass_rate or similar). Hard floor.
One faithfulness/correctness metric (verifier_agreement, grounding rate). Hard floor.
One cost ceiling. Hard.
One latency ceiling. Hard.
One retrieval quality metric. Soft.

Five budgets, four of them hard. That's enough discipline to prevent silent regression without drowning every PR in red ❌ marks.

Question

Doesn't this just slow down iteration? Some of my best changes had a small regression somewhere.

Probably the opposite. Without budgets, teams discover after weeks of "improvements" that some metric they weren't watching dropped by 8 points and they don't know which change caused it. Bisecting that mess is the time sink. Catching the regression in the PR that caused it, with the author still in context, takes seconds.

And: the budget overrides are not bureaucratic. A one-line PR comment is enough. "Override: +0.003 cost is within Q3 plan." Done. The friction is calibrated to the cost of the decision.

STEP 3

The PR-comment workflow.

Step 1 gave you honest deltas. Step 2 gave you budgets. Both are useless unless they show up where decisions actually get made — which is the pull request, not your terminal. This step wires the eval into your version control workflow so every code review includes an eval review, automatically.

The shape: a developer opens a PR. CI fires up, runs the eval against the PR branch, runs it again against main, computes the delta with budget verdicts, posts the result as a comment on the PR. The comment updates on every push so reviewers always see the current state. Hard violations show up as a failed status check; soft violations show up as a warning in the comment. Code review and eval review happen in the same conversation.

The CI workflow

Concretely, with GitHub Actions. The same shape works on GitLab, CircleCI, Buildkite — the only differences are syntactic.

# .github/workflows/eval-pr.yml
name: eval-pr
on:
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }   # need history for baseline lookup

      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }

      - name: Install deps
        run: pip install -r requirements.txt

      - name: Run fast eval suite (always)
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: make eval-fast

      - name: Run full suite if labeled 'eval-full'
        if: contains(github.event.pull_request.labels.*.name, 'eval-full')
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: make eval-full

      - name: Compute delta vs main
        id: delta
        run: |
          python scripts/eval_delta.py \
            --against main \
            --commit ${{ github.event.pull_request.head.sha }} \
            --format markdown \
            > delta.md

      - name: Post PR comment
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: delta.md
          header: eval-results   # sticky: replaces previous comment

      - name: Fail on hard violations
        run: python scripts/eval_delta.py --check-hard \
             --commit ${{ github.event.pull_request.head.sha }}

Two things in this YAML earn their keep. First, the label-gated full suite: most PRs only run the fast suite (retrieval + unit checks, <2 minutes); PRs that touch the agent loop, prompts, or model selection get the eval-full label and run the slow trajectory suite too. The label-as-trigger is much better than running everything always (slow, expensive) or only on merge (catches regressions too late).

Second, the sticky comment. Every push updates the same comment instead of creating a new one. Reviewers see only the current state, not a history of "the eval got better, then worse, then better as the author iterated." That history is in the scoreboard.csv; the PR comment is a snapshot.

The comment format

This is what a reviewer should see. Tuned over many iterations — the goal is "glanceable in 3 seconds, drill-down for the curious":

## 📊 eval-results

vs main (b3a1f7e)
**overall: 0.732 → 0.768 (+0.036) ↑ [REAL]**

|  metric                   |  base  |   pr   |  delta  | verdict |
| ------------------------- | ------ | ------ | ------- | ------- |
|  retrieval_recall_at_5    | 0.810  | 0.860  | +0.050  |   ✓     |
|  retrieval_mrr            | 0.740  | 0.790  | +0.050  |   ✓     |
|  verifier_agreement       | 0.880  | 0.870  | -0.010  | ⚠ SOFT  |
|  trajectory_pass_rate     | 0.690  | 0.710  | +0.020  |   ✓     |
|  trajectory_steps_avg     | 4.200  | 4.400  | +0.200  | ⚠ SOFT  |
|  cost_per_run_usd         | 0.041  | 0.044  | +0.003  | ⚠ SOFT  |

cost: $1.84 (full suite, 50 trajectories)
runtime: 14m 22s

— [scoreboard.csv](link) · [full results](link)

What's worth noting about this format. The headline metric is at the top with the biggest signal. Every metric has a tabular row with the verdict column on the right — reviewers scan that column. Soft violations appear without scaring anyone; hard violations would appear as ❌ in the verdict column and would also fire the failing status check at the top of the PR.

The runtime and cost lines exist for the same reason: they're metadata the reviewer might want, and one of them is itself a budgeted metric (cost). Don't bury them; they belong next to the deltas.

What this actually changes

Eval-driven development isn't a tool, it's a habit. The habit is: I will not merge a change to the agent without seeing its eval delta, and the team will not approve a change to the agent without reading the eval delta. The CI infrastructure exists to make this habit cheap. Without CI, the habit dies — people forget, get rushed, ship without measuring. With CI, the habit is automatic: the eval result is sitting right there in the PR every time.

The first time a teammate's clever-looking refactor gets rejected because evals dropped 4 points on faithfulness — that's the chapter earning its keep. The first time someone explores 5 prompt variations in a draft PR and you can see which one was actually better — that's the chapter earning its keep. The first time a junior engineer ships their first change with confidence because the numbers say it's an improvement — that's the chapter earning its keep.

If you do only one thing from this chapter: get the PR-comment workflow set up before anything else. Budgets, Pareto thinking, multi-run measurement — all of those are downstream of "do my teammates see eval results when reviewing my code." Without that, the rest doesn't happen.

Question

Running the full eval on every PR is going to cost real money. How do I budget for that?

Realistic numbers: a 50-question trajectory suite with judge calls runs $1–5 per execution depending on your agent's average tokens and the judge model. With label-gated full runs, you're paying that maybe 5 times a week instead of 50 — so $25/week, $100/month. For most teams that's well inside the experimentation budget. For early-stage projects, a $50/month line item is invisible.

The numbers that do add up: running the full suite on every push of every PR (×10–20 pushes per PR), running on commits to feature branches that aren't ready, running on PRs that touch nothing relevant. The label gate is what keeps the bill sane.

Question

My team doesn't use GitHub Actions / our CI is locked down / we can't easily run external eval workflows. What's the minimum viable version?

The minimum viable version is a script and a Slack/Discord webhook. Developer runs make eval-pr locally before pushing. The script posts the delta to a team channel. Reviewers read it before approving.

This is meaningfully worse than CI — the discipline rests on humans remembering — but it's meaningfully better than nothing. Most teams I've seen with this manual workflow eventually graduate to CI, but the workflow that matters is "every change has a visible delta," not "CI runs it." Start with what you can build today.

Question

How does this work for monorepos with non-agent code in the same PR?

Path-filter the workflow. Only run the eval if the PR touches files under agent/, prompts/, retrieval/, tools/, or evals/ itself. GitHub Actions supports paths: filters at the workflow trigger level. PRs that only touch the frontend or unrelated services skip the eval, saving cost and CI time. PRs that touch any agent code get the full treatment.

STEP 4

Versioning & cadence: what a day actually looks like.

The last piece is the one teams almost always learn the hard way: your eval score is meaningless if the underlying model changed under you. A score of 0.732 on Sonnet 4.5 is not comparable to a score of 0.741 on Sonnet 4.6 — but if you compare them anyway, you'll conclude that something you did improved the agent when in fact the model provider made it better while you slept. Worse, you might conclude that something you did regressed the agent when the provider quietly switched a checkpoint.

This is the reproducibility problem, and it has a partial answer: version everything that contributes to the score, alongside the score itself.

The four versioned axes

Every eval row needs to record, at minimum, these four things:

Code SHA — your agent code at the moment the eval ran. You already have this from git rev-parse HEAD.
Prompt version — your system prompts and tool descriptions hashed or version-tagged. Prompts change independently of code in fast iteration.
Model identifier — including any version suffix the provider exposes. claude-sonnet-4-5 is not enough; claude-sonnet-4-5-20250929 is.
Corpus version — the document set the agent retrieved from. A corpus refresh changes scores; pretending it didn't is the source of many "mysterious" regressions.

# Updated scoreboard schema with versioning
commit, timestamp, branch,
  prompt_hash,     # sha256 of prompts/ directory
  model_id,        # claude-sonnet-4-5-20250929 (full string)
  corpus_version,  # git tag of corpus repo, or DVC hash
  overall, retrieval_recall_at_5, ...

Computing the version stamps

# evals/version.py
import hashlib, subprocess
from pathlib import Path

def prompt_hash() -> str:
    """Hash every file in prompts/ — order-stable."""
    h = hashlib.sha256()
    for p in sorted(Path("prompts").rglob("*.txt")):
        h.update(p.read_bytes())
    return h.hexdigest()[:12]

def model_id(response) -> str:
    """Pull the resolved model ID from a real API response.
    Providers may resolve aliases to a specific dated snapshot;
    record what we actually got, not what we asked for."""
    return response.model  # Anthropic/OpenAI both return this

def corpus_version() -> str:
    """Whatever your corpus uses for versioning."""
    # If your corpus is a git submodule:
    return subprocess.check_output(
        ["git", "-C", "corpus", "rev-parse", "--short", "HEAD"]
    ).decode().strip()
    # If you use DVC: return dvc.api.read_metadata("corpus.dvc")["md5"]

What this lets you do

Now the delta tool can be smarter. Before computing a delta, it checks the version stamps. If any non-code axis differs, it flags the comparison:

$ python scripts/eval_delta.py --commit d1f5b88

⚠ baseline and candidate differ on non-code axes:
  - model_id:   claude-sonnet-4-5-20250929 → claude-sonnet-4-5-20251015
  - corpus_v:   a3f2c1d → a3f2c1d  (same)
  - prompts:    7b2a... → c1e8...  (changed in this PR)

  This delta mixes prompt changes AND a model snapshot change.
  Re-run baseline on the new model_id before drawing conclusions.

  overall                           0.732 → 0.768  +0.036
  ...

This single warning saves you from the most common false-conclusion in agent development. When a model snapshot ships under your feet, your old baseline is stale — you have to re-measure the baseline on the new snapshot before any candidate delta means anything.

The fix is mechanical: nightly, re-run the eval against main with the current model_id and add a fresh baseline row to the scoreboard. Then daytime PR comparisons always use a recent baseline on the same model snapshot. Cheap, automatic, eliminates the class of bug.

The daily cadence

This is what eval-driven development actually looks like as a working practice. Day-shaped, written so you can compare to your current Tuesday.

┌─────────────────────────────────────────────────────────────────┐ │ MORNING │ │ │ │ 1. Check last night's nightly run │ │ → baseline refreshed? any silent drift? │ │ 2. Open scoreboard.csv, scan last 7 days of main │ │ → any metric drifting in a direction you don't like? │ │ 3. Pick today's work from the queue │ │ │ ├─────────────────────────────────────────────────────────────────┤ │ AFTERNOON: one change at a time │ │ │ │ 1. Hypothesis: "I think doing X will improve metric Y" │ │ → write the hypothesis in the PR description NOW. │ │ │ │ 2. Implement the smallest version of X. │ │ → not the cleanest, not the prettiest. The smallest │ │ possible change that lets you measure. │ │ │ │ 3. Push. Wait ~5 min for fast eval, ~20 min for full. │ │ → if labeled eval-full. │ │ │ │ 4. Read the delta. │ │ ─ Hypothesis confirmed AND no soft violations? │ │ → polish the code, get review, merge. │ │ ─ Hypothesis confirmed but with regressions? │ │ → decide: is the trade worth it? Discuss on PR. │ │ ─ Hypothesis not confirmed? │ │ → close the PR. Open a new one with hypothesis 2. │ │ → write what you learned in the closed PR's notes. │ │ │ ├─────────────────────────────────────────────────────────────────┤ │ END OF DAY │ │ │ │ 1. Look at the day's PRs that got merged. │ │ → did the aggregate move overall in the right direction? │ │ 2. If multiple PRs in a row had "hypothesis not confirmed" │ │ → pause. Your priors might be wrong. Read traces, talk │ │ to a colleague, change strategy before iterating more. │ └─────────────────────────────────────────────────────────────────┘

The most important part of this cadence is the bit that looks like a footnote: write the hypothesis in the PR description before implementing. This is the discipline that distinguishes eval-driven from eval-decorated development. If you write the hypothesis after you see the result, you'll rationalize whatever you find. If you write it before, you have an honest record of what you predicted and whether you were right. Over time, this is how you learn what your priors are good at and what they're bad at.

One last anti-pattern to call out

The most common failure mode in teams adopting EDD: tweaking the eval suite to make the score go up. This happens innocuously — someone notices that case 17 is a known weird edge case, removes it from the suite, the overall score goes up, everyone feels good. Multiply by a few months and your eval is measuring nothing useful because every hard case got removed.

The rule: the eval suite is append-only by default. You can add cases freely. You cannot remove or modify cases without a separate PR, separately reviewed, with an explicit justification. Removing a hard case is a serious decision; making it into a separate PR with its own review surfaces the decision so it gets the scrutiny it deserves.

Watch for this pattern: a team's overall eval score climbs steadily for three months, everyone celebrates. Then they ship a major release and users complain about exactly the failure modes the team was no longer testing. The score went up; the agent didn't. Append-only eval discipline is what prevents this.

Question

What if my hypothesis is too vague to write down in advance?

That's a signal. If you can't articulate what metric should move in what direction, you don't have a hypothesis yet — you have a hunch. Hunches are fine, but they should be explored in a "spike" — a throwaway branch where you poke around without expectation. Once you have a hypothesis you can write down ("if I increase retrieval top_k from 5 to 8, recall should improve by ≥0.02 with at most +0.003 cost increase"), open the real PR.

The discipline of writing it down forces you to know what you're trying to do before you do it. That's the entire point.

Question

My team is one person (me). Is all this overkill?

The CI infrastructure is overkill for one person. The hypothesis-first discipline is not. Even solo, write the hypothesis in the commit message. Even solo, refuse to convince yourself that a regression is "fine." Even solo, look at deltas before deciding a change worked. The reason: you're going to forget what you did and why in three months. The notes are for future-you, not for a teammate.

The lightweight solo version: scoreboard.csv + a script that prints deltas + the hypothesis-in-commit-message habit. Maybe 100 lines of code total. Pays for itself in two weeks.

Question

How do I deal with model drift across providers if I'm benchmarking Anthropic vs OpenAI?

Treat them as different products. They are. Don't try to compute "this change is +0.02 on Anthropic and -0.01 on OpenAI, net positive" — that calculation is incoherent. Instead, maintain two scoreboard files (or two columns), report deltas for each independently, and let the team decide separately whether each direction is acceptable. Most teams converge on a primary provider for production and use the other as a probe for "is our agent design biased to one provider's behaviors?" Different question, different answer.

WORKED EXAMPLE

Three PRs from 73% to 81%, with the cadence visible.

The cadence diagram in Step 4 describes the rhythm. Here's a worked sequence — three consecutive days on the same agent — showing how it plays out as PRs land. The agent is the research assistant from Part I. Starting point: overall = 0.732 on the 50-question eval set. Goal: get above 0.80 within the week.

Monday: hypothesis-driven, confirmed

Morning scoreboard scan. The agent's trajectory_pass_rate has been flat at 0.69 for two weeks. The retrieval_recall_at_5 is at 0.81 — decent but with room. The hunch: retrieval is the bottleneck. If the model can't see the right chunks, it can't synthesize a good answer.

The PR description, written before any code:

"""
Add cross-encoder reranking to retrieval.

HYPOTHESIS: Initial BM25 + embedding fusion gets the right document
into the top 20 results in ~95% of cases (measured), but the right
chunk only makes top-5 in ~81%. A cross-encoder rerank on the
top-20 should push more of those right chunks into the top 5.

PREDICTION:
- retrieval_recall_at_5: +0.04 to +0.06  (real, above 2σ noise)
- retrieval_mrr:         +0.04 to +0.05
- trajectory_pass_rate:  +0.02 (downstream of retrieval improvement)
- cost_per_run_usd:      +0.002 to +0.004 (one extra small-model call)
- trajectory_steps_avg:  no change

REJECTION CRITERION: if retrieval_recall_at_5 doesn't beat baseline
by at least 2σ (≥0.012), the hypothesis is wrong and I close this PR.
"""

Afternoon: implement, push, wait. 45 minutes to add a Cohere rerank call on the top-20 candidates. Push. The fast eval suite runs in CI (3 minutes); the full suite is gated behind the eval-full label, which the PR template auto-applied because the diff touches retrieval/. Full suite takes 18 minutes.

The sticky PR comment that lands:

## 📊 eval-results

vs main (b3a1f7e)
**overall: 0.732 → 0.768 (+0.036) ↑ [REAL]**

|  metric                   |  base  |   pr   |  delta  | verdict |
| ------------------------- | ------ | ------ | ------- | ------- |
|  retrieval_recall_at_5    | 0.810  | 0.860  | +0.050  |   ✓ REAL|
|  retrieval_mrr            | 0.740  | 0.790  | +0.050  |   ✓ REAL|
|  verifier_agreement       | 0.880  | 0.870  | -0.010  | noise   |
|  trajectory_pass_rate     | 0.690  | 0.710  | +0.020  | noise   |
|  trajectory_steps_avg     | 4.200  | 4.400  | +0.200  | noise   |
|  cost_per_run_usd         | 0.041  | 0.044  | +0.003  | ⚠ SOFT  |

cost: $1.84 (full suite)  ·  runtime: 18m 12s

Reading this

The prediction landed almost exactly in the predicted band. Retrieval moved REAL and substantially; the trajectory and verifier moves are within noise (a single run isn't enough to tell, would need 3× to be sure); cost rose by exactly the predicted amount, hitting the soft ceiling but not the hard one. The PR description's rejection criterion was ≥0.012 on retrieval recall; the actual move is +0.050, comfortably above. The hypothesis held.

The soft cost violation prompts a quick PR conversation: a reviewer asks whether $0.003/run × ~3000 runs/day = ~$270/month is worth +0.05 retrieval recall. The author writes a one-line justification: "Yes — this unblocks Q3 product use cases that needed the recall lift." Reviewer approves. Merge.

End-of-day note in the project journal: "Retrieval rerank: confirmed hypothesis. +0.036 overall. The trajectory and verifier noise floors made it impossible to tell if there were downstream effects from one run — should re-measure after 2-3 more changes land or with a 3× multi-run."

Tuesday: hypothesis-driven, rejected

Morning. Yesterday's win is now main. Today's hypothesis: the verifier prompt is too lenient — it's marking claims as "SUPPORT" when the evidence is only tangential. A tighter prompt should raise verifier_agreement with the hand-labeled set.

PR description:

"""
Tighten verifier prompt to require explicit evidence.

HYPOTHESIS: Current prompt asks "is the claim supported?" which
the judge interprets loosely. Asking "does the source contain a
sentence that explicitly states the claim, or that the claim
follows from by a single inference step?" should reduce false-
positive SUPPORT verdicts.

PREDICTION:
- verifier_agreement:    +0.03 to +0.05 (REAL)
- trajectory_pass_rate:  no change (verifier is downstream)
- retrieval_recall_at_5: no change (orthogonal)

REJECTION: if verifier_agreement is below the noise floor (0.036)
above baseline, the new prompt isn't actually tighter, just
different. Close.
"""

Result:

## 📊 eval-results

vs main (c9e2a44)
**overall: 0.768 → 0.766 (-0.002) · noise**

|  metric                   |  base  |   pr   |  delta  | verdict |
| ------------------------- | ------ | ------ | ------- | ------- |
|  verifier_agreement       | 0.870  | 0.882  | +0.012  | noise   |
|  retrieval_recall_at_5    | 0.860  | 0.858  | -0.002  | noise   |
|  trajectory_pass_rate     | 0.710  | 0.705  | -0.005  | noise   |
|  cost_per_run_usd         | 0.044  | 0.044  | +0.000  | ✓       |

The hypothesis predicted +0.03 to +0.05 on verifier_agreement; actual is +0.012, which is below the noise floor (0.036). By the PR's own rejection criterion, the hypothesis is not confirmed. The author has a choice:

Option A: Run the eval 3× to get a tighter measurement, since verifier_agreement is one of the noisy metrics. Possibly the +0.012 is the bottom of a real +0.04 effect that didn't surface in one run.
Option B: Close the PR. Hypothesis was specific; it failed by its own rejection criterion. Don't move the goalpost.

The disciplined answer is B unless there's a specific reason to suspect this run is unrepresentative. The author writes a closing comment: "Hypothesis not confirmed (+0.012, below 0.036 noise). The new prompt may be slightly better, may be noise. Closing to avoid hindsight rationalization. The lesson: verifier prompt changes don't show up reliably in single-run evals — next time go straight to 3× measurement." Close.

Why this matters

Tuesday is the day eval-driven development really earns its keep. Without the rejection criterion in the PR description, the natural pull is to look at +0.012 and think "the new prompt is a little better, let's ship it." The criterion makes the decision mechanical: did the prediction hold or not? Not "is the new code defensible?" — engineers can defend almost anything. Did the prediction hold?

And the closing comment is doing real work: it's a note to future-self ("3× measurement is needed for noisy metrics"). Over months, these notes are how priors get sharper. Without them, you make the same mistake again in three weeks.

Wednesday: planned multi-run + small Pareto trade-off

Morning. Yesterday taught the lesson. Today's hypothesis is more careful — and the PR explicitly plans for 3× measurement:

"""
Use Sonnet for synthesis step instead of Haiku.

HYPOTHESIS: The synthesis step (final answer generation) currently
runs on Haiku for cost reasons. Several hand-reviewed failures
trace to the synthesis dropping or distorting facts from retrieved
chunks. Upgrading just this step (not retrieval ranking or
verifier) should raise trajectory_pass_rate.

PREDICTION:
- trajectory_pass_rate:  +0.04 to +0.06 (REAL, above 0.044 noise)
- cost_per_run_usd:      +0.008 to +0.012 (Sonnet vs Haiku, 1 call)
- verifier_agreement:    +0.01 to +0.02 (downstream — Sonnet
                         produces more faithful synthesis)

MEASUREMENT PLAN: 3× run on trajectory_pass_rate (it's noisy), 1×
on the deterministic metrics. Report mean and σ.

PARETO TRADE: cost will rise, possibly hitting soft ceiling. If
trajectory delta lands >+0.04, the trade is favorable. If
<+0.02, close.
"""

The CI runs 3× as requested (the eval-full label triggers a multi-run on the noisy metrics). 55 minutes total.

## 📊 eval-results  (n=3 for noisy metrics)

vs main (c9e2a44)
**overall: 0.768 → 0.812 (+0.044) ↑ [REAL]**

|  metric                 |  base  |  pr (μ ± σ)        |  delta  | verdict |
| ----------------------- | ------ | ------------------ | ------- | ------- |
|  trajectory_pass_rate   | 0.710  | 0.764 ± 0.018      | +0.054  | ✓ REAL  |
|  verifier_agreement     | 0.870  | 0.891 ± 0.014      | +0.021  | ✓ REAL  |
|  retrieval_recall_at_5  | 0.860  | 0.860              | +0.000  | ✓       |
|  trajectory_steps_avg   | 4.400  | 4.350              | -0.050  | ✓       |
|  cost_per_run_usd       | 0.044  | 0.053              | +0.009  | ⚠ SOFT  |

cost: $4.42 (3× full suite)  ·  runtime: 55m

Both predictions landed inside their bands. The Pareto trade is genuinely favorable: +0.054 on the highest-leverage metric (trajectory pass rate), +0.021 on faithfulness as a bonus, at a cost of ~$0.009/run extra. The reviewer asks whether $0.009 × 3000 runs/day ≈ $810/month is justified. Author: "Yes — the synthesis-failure rate was the biggest contributor to user-reported quality issues last sprint. The Pareto frontier moved in exactly the right direction." Approved. Merge.

End-of-week scoreboard

Three PRs, three different outcomes — confirmed, rejected, confirmed. Aggregate movement on overall: 0.732 → 0.812 (+0.080). Above the 0.80 goal in three days. The story the scoreboard tells:

commit, timestamp, branch, overall, retrieval, verifier, trajectory, cost
b3a1f7e, Mon  8:14, main,      0.732,  0.810, 0.880, 0.690, 0.041  # baseline
c9e2a44, Mon 14:33, main,      0.768,  0.860, 0.870, 0.710, 0.044  # +rerank
# tue: hypothesis closed (verifier prompt) — no merge, no scoreboard row
e7b3c20, Wed 16:20, main,      0.812,  0.860, 0.891, 0.764, 0.053  # +Sonnet syn

The cadence visible in the data: two real merges, one explicit close. Costs visible. Subscore movements visible. Score deltas were the unit of decision throughout. Predictions were written before code each time, and verdicts were mechanical.

The thing that's not visible in the scoreboard but is the most important part of the practice: Tuesday's closed PR is just as much progress as Monday's and Wednesday's. A team that ships only confirmed-hypothesis PRs is learning faster than a team that ships every PR regardless of result. The closing PR built knowledge that informed Wednesday's measurement plan. That's the compounding effect.

End of chapter 3.1

Deliverable

An agent codebase where no change merges without a visible eval delta — the discipline that lets your numbers compound upward instead of drifting. Multi-run measurement for noisy metrics; budgets that gate merges on the metrics that matter; a sticky PR comment that reviewers actually read; version stamps that catch silent model drift; a daily cadence built around hypothesis-first development. Chapter 3.2 splits these evals into the three layers (unit, integration, end-to-end); chapter 3.3 hardens the LLM-as-judge that produces some of the trickiest metrics; chapter 3.4 ties it all into CI and external benchmarks.

scoreboard.csv with subscore breakdown, version stamps on every row
eval_delta.py with noise-floor labels (REAL vs noise based on 2σ)
budgets.yaml with hard/soft thresholds for the 3–5 metrics that matter
Multi-run protocol for noisy metrics (median of 3, with σ flagged)
CI workflow posting sticky PR comments with delta tables
Label-gated full suite ('eval-full' triggers the slow run)
Nightly baseline re-run against current model snapshot
Hypothesis-first PR template, with a "what I expected" field
Append-only eval suite policy in CONTRIBUTING.md