First Eval Suite — The Agentic AI Field Guide

1.4

Part I / Build · Week 4

Your first eval suite.

The thing that separates "built one" from "ships them." Three layers of evals: components, trajectories, end-to-end. You can't improve what you can't measure, and you can't measure agents by reading traces — you measure them with datasets, harnesses, and judges. This chapter builds the first version end-to-end; Part III · Evaluate extends the practice into a continuous discipline.

STEP 1

Build a 50-question dataset.

The hardest part of evals isn't the harness — it's the dataset. A bad dataset gives you confidence in a bad agent. A good dataset surfaces real failure modes. Spend a day on this. Don't outsource it. Don't generate the whole thing with an LLM and trust it.

The four tiers

Stratify the 50 questions into four tiers so the eval distinguishes capabilities, not just averages.

Tier 1 (15 questions): Lookups. Single fact, one document. "What port does Postgres use by default?" If the agent fails these, retrieval is broken.
Tier 2 (15 questions): Multi-hop. Need 2-3 docs synthesized. "How does PgBouncer pooling interact with prepared statements?"
Tier 3 (10 questions): Comparative. Comparing options with trade-offs. "When to use partitioning vs sharding."
Tier 4 (10 questions): Negative cases. Things not in the corpus. "How do I configure PgBouncer with Redis?" The agent should say "not in corpus," not invent an answer.

The schema

Provider-agnostic; just JSON.

# evals/dataset.jsonl — one question per line
{
  "id": "q001",
  "tier": "lookup",
  "question": "What port does PostgreSQL listen on by default?",
  "expected_answer_contains": ["5432"],
  "expected_chunks": ["runtime-config-connection::0"],
  "forbidden_phrases": ["I don't know", "not sure"],
  "notes": "basic fact, must succeed on every run"
}
{
  "id": "q024",
  "tier": "multi_hop",
  "question": "Why do prepared statements break in PgBouncer transaction mode?",
  "expected_answer_contains": ["session", "transaction"],
  "expected_chunks": ["pgbouncer-modes::1", "sql-prepare::0"],
  "requires_synthesis": true
}
{
  "id": "q045",
  "tier": "negative",
  "question": "How does PostgreSQL integrate with Redis Streams?",
  "expected_behavior": "refuse",
  "expected_answer_contains": ["not in", "corpus"],
  "forbidden_phrases": ["you can use", "the integration"]
}

How to build the questions

Three sources, in order:

Your trace log from Phases 1-3. Every question your agent answered wrong is a perfect eval question. Failing real cases > inventing cases.
Browsing the docs. Open the corpus, find sections you find non-trivial, write a natural-language question whose answer is in that section.
LLM-generated, then human-filtered. Have a model propose 100 questions from the corpus; you keep the 30 that are actually realistic and discard the rest. This is the only step where LLMs help, and only with a human gate.

Do not let the agent itself generate the eval dataset. You will end up measuring "questions the agent finds easy" and never see the questions it can't handle.

Question

Why only 50 questions? Bigger is better, right?

Bigger is better asymptotically, but the first 50 are the highest-leverage questions you'll ever write. They're where you fix the most bugs per dollar of API spend. Spend the time on quality: each question should test something specific and have unambiguous pass/fail criteria.

Grow the dataset later, after you know what failure modes are common in production. 50 → 200 → 500 as you scale, not 5000 upfront.

STEP 2

Evaluate components in isolation.

End-to-end evals are noisy. They mix retrieval failures, planning failures, synthesis failures, and verifier failures into one number. To debug, you need to know which component is broken. So evaluate components separately.

Retrieval eval: recall@k and MRR

For each question in the dataset that has expected_chunks, run retrieval and ask: did we surface the expected chunk in the top k results?

# evals/eval_retrieval.py
import json
from retrieval import retrieve
from retrieval.bm25_index import BM25Index
from retrieval.dense_index import DenseIndex
from retrieval.hybrid import HybridSearch

def load_dataset():
    return [json.loads(line)
            for line in open("evals/dataset.jsonl")]

def eval_retrieval(retriever, dataset, k=5):
    hits, mrr_sum, n = 0, 0.0, 0
    for q in dataset:
        if not q.get("expected_chunks"): continue
        n += 1
        results = retriever(q["question"], top_k=k)
        result_ids = [c.chunk_id for c in results]
        # Recall: at least one expected chunk in top-k
        if any(eid in result_ids
               for eid in q["expected_chunks"]):
            hits += 1
        # MRR: 1 / rank of first expected chunk
        for rank, rid in enumerate(result_ids, 1):
            if rid in q["expected_chunks"]:
                mrr_sum += 1.0 / rank
                break
    return {"recall@k": hits / n, "mrr": mrr_sum / n, "n": n}

What it tells you

Run the same eval with three different retrievers — BM25 only, dense only, hybrid — and compare:

              recall@5   recall@20    MRR    n
BM25 only       0.62       0.81      0.41   40
Dense only      0.68       0.85      0.47   40
Hybrid (RRF)    0.82       0.94      0.58   40
Hybrid+rerank   0.89       0.94      0.71   40

How to read this

Recall@5 from 0.62 to 0.89 is what justified all of Phase 2. Without numbers like these, you're guessing whether the complexity was worth it.

recall@20 plateaus at 0.94 for both hybrid variants. That means the right chunk is in the top 20 most of the time — the reranker isn't finding better chunks, it's ordering them better. MRR (0.58 → 0.71) confirms this — the right chunk is moving up the list.

The 6% missing from recall@20 is questions where retrieval simply can't find the right chunk. That's a chunking problem (chunk doesn't contain the answer phrase) or a corpus problem (answer requires multi-doc synthesis you can't get from retrieval alone).

Verifier eval: precision and recall on claims

Hand-label 30 (claim, sources, verdict) triples — yourself, not an LLM — then run the verifier and check agreement.

# evals/verifier_gold.jsonl — hand-labeled
{"claim": "Transaction pooling breaks prepared statements",
 "sources": ["pgbouncer-modes::1"],
 "human_verdict": "SUPPORT"}
{"claim": "This is the most common migration failure",
 "sources": ["pgbouncer-modes::1"],
 "human_verdict": "NOT_SUPPORTED"}

def eval_verifier(gold):
    correct = 0
    for ex in gold:
        result = verify(ex["claim"],
                        load_chunks(ex["sources"]))
        v = result["claims"][0]["verdict"]
        if v == ex["human_verdict"]:
            correct += 1
    return {"agreement": correct / len(gold)}

verifier agreement with humans: 0.87  (26 of 30)

inspecting disagreements:
  3 cases where verifier said PARTIAL, human said SUPPORT
    → verifier is slightly too strict on phrasing
  1 case where verifier said SUPPORT, human said PARTIAL
    → genuine bug; source says "X under condition Y"
      but verifier missed the condition

What 87% agreement means

Of 30 hand-labeled claims, the verifier got 26 right. The 4 disagreements split: 3 false-PARTIAL (overly strict — not a safety problem) and 1 false-SUPPORT (genuine bug — a hallucination would slip through). Acceptable for now; the false-SUPPORT case becomes a regression test for the next verifier-prompt iteration.

Below 80% agreement, the verifier is adding noise; tune until you cross 80% before relying on it.

The same work, in RAGAS vocabulary

What we just built has standard names that practitioners now expect you to use. RAGAS is the common framework that packages these metrics (LLM-as-judge under the hood, the same idea as our verifier and judge):

Faithfulness — is the answer grounded in the retrieved context? This is exactly our grounding verifier: every claim must trace to a cited chunk.
Answer (response) relevancy — does the answer actually address the question, not just stay on-topic?
Context precision — of the chunks we retrieved, how many are actually relevant (signal vs noise in the top-k)?
Context recall — did retrieval surface everything needed to answer? This is our recall@k by another name.

Why the mapping matters

You don't need a new tool — recall@k is context recall, and the verifier is a faithfulness check. But when a teammate or a vendor says "our faithfulness is 0.91 but context precision is low," you should hear "answers are grounded but retrieval drags in junk" and know which knob to turn. Learn the vocabulary; the work is already done.

STEP 3

Evaluate trajectories with LLM-as-judge.

Component evals tell you whether each piece works. Trajectory evals tell you whether the agent as a whole succeeds — and that's not just "is the final answer right." It's also: did it use a reasonable number of steps? Did it cite the right sources? Did it avoid forbidden behaviors?

Hard checks first (cheap, deterministic)

# evals/eval_trajectory.py
def hard_checks(q: dict, result: dict) -> dict:
    answer = result.get("answer", "").lower()
    checks = {}

    # Did the answer mention expected terms?
    expected = q.get("expected_answer_contains", [])
    checks["contains_expected"] = all(
        term.lower() in answer for term in expected
    )

    # Avoid forbidden phrases (especially for negatives)
    forbidden = q.get("forbidden_phrases", [])
    checks["no_forbidden"] = not any(
        p.lower() in answer for p in forbidden
    )

    # Citation overlap with expected chunks
    cited = set(result.get("citations", []))
    expected_chunks = set(q.get("expected_chunks", []))
    if expected_chunks:
        checks["citation_overlap"] = bool(cited & expected_chunks)

    # Step efficiency: didn't burn budget
    checks["under_budget"] = result.get("steps_used", 99) <= 8

    # For negative cases: agent must have refused
    if q.get("expected_behavior") == "refuse":
        checks["refused"] = (
            "not in" in answer or
            "corpus" in answer or
            "don't have" in answer
        )

    return checks

Then soft checks via LLM-judge

For "is the answer correct and well-formed" — questions that hard checks can't answer — use a judge model. Use a different model than the agent (cross-model judging reduces self-confirmation bias). If the agent runs on Claude, judge with GPT, and vice versa.

# evals/judge.py — judge with GPT
from openai import OpenAI
judge_client = OpenAI()

JUDGE_PROMPT = """Rate the agent's answer on three axes.

QUESTION: {question}
EXPECTED ANSWER NOTES: {expected_notes}
AGENT'S ANSWER: {answer}
AGENT'S CITATIONS: {citations}

Score each from 1 (terrible) to 5 (excellent):
- correctness: does it answer the question accurately?
- completeness: are key aspects covered?
- groundedness: are claims supported by citations?

Output JSON: {{"correctness": N, "completeness": N,
"groundedness": N, "reasoning": "one sentence"}}"""

def judge(q, result):
    response = judge_client.responses.create(
        model="gpt-5.5",
        input=JUDGE_PROMPT.format(
            question=q["question"],
            expected_notes=q.get("notes", ""),
            answer=result.get("answer", ""),
            citations=result.get("citations", []),
        ),
        text={"format": {"type": "json_object"}},
    )
    return json.loads(response.output_text)

# evals/judge.py — judge with Claude
from anthropic import Anthropic
judge_client = Anthropic()

JUDGE_PROMPT = """Rate the agent's answer on three axes.

QUESTION: {question}
EXPECTED ANSWER NOTES: {expected_notes}
AGENT'S ANSWER: {answer}
AGENT'S CITATIONS: {citations}

Score each from 1 (terrible) to 5 (excellent):
- correctness: does it answer the question accurately?
- completeness: are key aspects covered?
- groundedness: are claims supported by citations?

Output JSON only: {{"correctness": N, "completeness": N,
"groundedness": N, "reasoning": "one sentence"}}"""

def judge(q, result):
    response = judge_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(
                       question=q["question"],
                       expected_notes=q.get("notes", ""),
                       answer=result.get("answer", ""),
                       citations=result.get("citations", []),
                   )}],
    )
    return json.loads(response.content[0].text)

Run the judge 3 times per question

LLM judges are noisy. Their scores vary across runs even with temperature=0. To get a reliable signal, run each judgment 3 times and take the median. The variance itself is interesting data: if a judge gives scores [3, 5, 4] for the same example, your eval signal is weak for that question.

def judge_with_variance(q, result, n=3):
    scores = [judge(q, result) for _ in range(n)]
    def median(key):
        return sorted(s[key] for s in scores)[n // 2]
    return {
        "correctness": median("correctness"),
        "completeness": median("completeness"),
        "groundedness": median("groundedness"),
        "variance": max(s["correctness"] for s in scores) -
                    min(s["correctness"] for s in scores),
    }

The full eval run

$ python scripts/eval.py --full

evaluating 50 questions across 4 tiers...

Tier 1 (lookup):     15/15 hard pass    avg judge: 4.7 / 5.0
Tier 2 (multi-hop):  13/15 hard pass    avg judge: 4.2 / 5.0
Tier 3 (comparative): 8/10 hard pass    avg judge: 3.9 / 5.0
Tier 4 (negative):    9/10 hard pass    avg judge: 4.5 / 5.0

OVERALL hard-pass rate: 45/50 = 90.0%
OVERALL judge correctness (median): 4.3 / 5.0
OVERALL judge groundedness (median): 4.6 / 5.0

high-variance questions (judge disagrees with itself):
  q017: scores [3, 5, 4] — answer is partially correct,
        judge unsure which way to score
  q031: scores [2, 5, 3] — answer is verbose; judge
        differs on whether length helps or hurts

failing questions:
  q024: missed expected chunk pgbouncer-modes::1
        (retrieval ranked it at 7)
  q037: cited a chunk that doesn't support the claim
        (verifier should have caught this — bug)
  q045: invented Redis integration details
        (negative case failed)

This is what shipping looks like

90% hard-pass is a real, defensible number you can graph over time. Specific failing questions become specific bugs to fix. The high-variance ones tell you where your eval signal is weakest — those are candidates for rewriting (more precise question, clearer expected answer).

You now have something you can show a teammate: "our agent passes 90% of the test set, here's what fails and why, here's the trajectory of the number across the last 20 commits."

Question

Isn't LLM-as-judge unreliable? I've read criticism of it.

Yes, used naively. The known failure modes are: position bias (preferring the first answer shown), length bias (preferring longer answers), self-affinity (a model rating its own outputs higher), and verbosity bias on subjective questions. All real.

Mitigations that work: cross-model judging (Claude evaluating GPT or vice versa, never self-rating), running 3+ times and using medians, hard checks for objective facts before any LLM judgment, and hand-auditing ~10% of judgments monthly to detect drift.

With these guardrails, LLM-as-judge is the best tool we have for scoring open-ended outputs. Without them, it's a random-number generator that feels rigorous.

Question

Why median instead of mean across 3 runs?

Robustness to outliers. If the judge gives [3, 4, 5], mean and median both say 4 — fine. If it gives [4, 4, 1], the mean says 3 (an outlier dragged it down) while the median says 4 (correctly ignores the outlier). LLM judges occasionally produce wildly off scores; median is more stable.

With n=3, you're really getting a "majority vote with outlier tolerance." With n=5+, mean becomes safer. We use 3 because the cost-per-judge matters at 50 questions × 3 runs = 150 judge calls per eval run.

STEP 4

Wire evals into your dev loop.

An eval suite that you run once a week is research; an eval suite that runs on every commit is engineering. Make it cheap enough that you'll actually run it, fast enough to use during development, with clear pass/fail signals.

Two tiers: fast and full

The full eval takes minutes; you don't want to wait for it after every code change. Split into a fast tier (10 questions, runs in ~30 seconds) for iteration, and the full tier (50 questions, 3-5 minutes) for confidence.

# Makefile
.PHONY: eval-fast eval-full eval-watch

eval-fast:
    python scripts/eval.py --tier fast --n 10

eval-full:
    python scripts/eval.py --full --judge --variance 3

eval-watch:
    # Re-run fast eval whenever agent/ changes
    fswatch -o agent/ retrieval/ | xargs -n1 -I{} make eval-fast

ci:
    make eval-full
    python scripts/eval.py --regression-check

Track scores over time

The single most useful artifact: a CSV of every eval run with date, git commit, and scores. Plot it.

# evals/scoreboard.csv — append after every run
date,commit,hard_pass,judge_correct,judge_grounded,steps_p50
2026-05-08,a1b2c3d,0.72,3.8,4.1,5.2
2026-05-09,e4f5g6h,0.78,4.0,4.3,4.8   # added reranker
2026-05-10,i7j8k9l,0.84,4.2,4.5,4.5   # tuned chunk size
2026-05-12,m0n1o2p,0.88,4.3,4.6,4.2   # added planner
2026-05-15,q3r4s5t,0.90,4.3,4.6,3.8   # subagent context isolation

Now you have a story. "Adding the reranker bought us 6 points of hard-pass. Subagents didn't move correctness but cut median steps from 4.2 to 3.8." That's the kind of thing you ship in a PR description and present at a review.

Regression checks

A simple gate: any question that passed last time and fails now is a regression. Block the merge.

def regression_check(current, baseline):
    regressed = []
    for qid, c in current.items():
        b = baseline.get(qid)
        if b and b["passed"] and not c["passed"]:
            regressed.append(qid)
    if regressed:
        print(f"REGRESSION: {regressed}")
        sys.exit(1)
    print("no regressions")

The first time a teammate refuses to merge your PR because make ci shows a regression on q024, you'll know the eval suite is actually doing its job. That's the goal.

Question

My eval suite agrees with itself but disagrees with what users think is good. Why?

Your dataset doesn't reflect real usage. Common causes:

Question distribution wrong. Your test set has 30% comparatives but real users ask 80% lookups. Optimize for the lookups.
Phrasing too clean. Real users misspell, abbreviate, use jargon. Your eval should mirror that — pull real questions from logs once you have them.
Missing failure modes. Users complain about formatting or tone; your eval only checks correctness. Add tone/format checks.

The fix is dataset iteration. Treat the dataset as living software — review and update it monthly based on what users actually need.

Question

When should I move from this DIY harness to a vendor (Braintrust, Langfuse, Weights & Biases)?

When you have multiple people running evals, want to share results across a team, or need fancier features like A/B comparisons across runs and richer dashboards. The patterns we built here — dataset of expected outputs, hard checks + LLM-judge with variance, per-component evals, regression gates — are the same ideas those tools wrap in nicer UIs.

Build it yourself first so you understand the abstractions. Adopt a vendor when scale or team coordination demands it. The Python-and-Makefile approach is genuinely production-grade for a single team — don't feel obligated to pay for sophistication you don't yet need.

End of week 4

Deliverable

An eval suite that runs every commit, tracks scores over time, and blocks regressions. You can now make changes to the agent and know — with numbers — whether they helped, hurt, or were neutral. This is the difference between hobby and product. Part III · Evaluate extends this: eval-driven development as ongoing practice, calibrating LLM-as-judge, reading public benchmarks, CI integration.

50-question stratified dataset (4 tiers, JSONL)
Retrieval eval: recall@5, recall@20, MRR
Verifier eval: agreement with 30 hand-labels
Trajectory eval: hard checks + LLM-judge × 3 with variance
Makefile with eval-fast / eval-full / CI gate
scoreboard.csv tracking scores over commits