Evals, in plain words

B10
Concepts · Core Building Blocks

Evals, in plain words.

An eval is a small, trusted scoreboard you run against your own task — why public benchmarks aren't enough, and what a useful eval set looks like. The complement of reading benchmarks critically: that one taught you to be skeptical of public numbers; this one teaches you what to build in their place.

STEP 1

What an eval actually is.

Strip away the jargon and an eval is three things in a list, repeated:

  1. An input — a prompt, a document, a task description, a tool-call request.
  2. An expected behavior — what counts as a good response: the right answer, the right action, the right shape.
  3. A scoring rule — how you decide, programmatically or by a human, whether what the model produced matched the expectation.

That's it. An eval set is many such tuples, run through the same model (or agent), aggregated into a score. The score is your scoreboard. The set is the bench. The whole thing is the practice of writing down — in a form you can re-run — what "good" means for your task. If you can describe success in code, you can write an eval. If you can't describe it at all, you don't have a defined task yet, and you have a different problem.

Evals work for the same reason unit tests work: a small, repeatable, automatable signal beats a large, lossy, manual one. The model gets better or worse; your scoreboard tells you which.

STEP 2

Why public benchmarks aren't enough.

It's tempting to pick the model topping MMLU or LMSYS Arena and call it done. That fails for four reasons, all of which compound:

  • They measure proxy tasks. MMLU tests trivia; the arena tests "answer most users prefer." Neither is your customer-support classifier, your contract-clause extractor, or your code-review agent. A model can ace the proxy and still be wrong for you.
  • Saturation. Once a benchmark is solved, every frontier model scores ~95% and the differences vanish into noise. You can't pick between models that all answer "yes" to the same question.
  • Contamination. Famous benchmarks leak into training data. A high score may reflect memorization, not capability. The fresher and more obscure the benchmark, the cleaner the signal — which is exactly the property of your private set.
  • Selection bias. Vendor benchmark tables select the prompts, the harness, the comparison points. Useful as a claim, useless as an audit.

None of this means public benchmarks are worthless — they're great for shortlisting. But the moment the decision is "which model do I actually ship," the only number that counts is the one your own set produces. The full version of this argument is in reading benchmarks critically; the takeaway here is just: don't outsource the decision to a leaderboard.

STEP 3

What a useful eval set looks like.

The single biggest mistake first-time eval builders make is going for size. They scrape 10,000 examples, write a generic LLM-as-judge prompt, and wonder why the resulting score moves all over the place and tells them nothing. A useful eval set is the other way around:

  • Small. 10 to 100 cases is plenty to start. You'll learn more from 20 specific cases you understand individually than from 2,000 you've never read.
  • Specific to your task. Built from your real traffic, your real customers, your real edge cases. Generic eval suites measure generic things.
  • Scored cheaply, or at least repeatably. Best: a code-checkable rule (exact match, regex, schema validation, tool-call assertion). Second-best: an LLM judge with a tight rubric and pinned model version. Worst-but-sometimes-fine: human review, capped to a sample.
  • At least one easy case. A sanity case that should pass trivially — if it doesn't, your harness is broken, not your model.
  • At least one hard case. A known-bad example you've fixed before, kept as a regression target. The day a model release silently re-breaks it, your scoreboard will tell you.

Ten specific cases beat a thousand generic ones. If you can't justify why each case is in your set, it shouldn't be in your set. Quality of cases dominates quantity; a noisy 1000-case scoreboard makes worse decisions than a sharp 20-case one.

STEP 4

Online vs offline, and what to read next.

One distinction worth carrying out of this entry: offline evals run against a curated set in a CI-like loop — you control the inputs, you know the expected outputs, you run them before shipping. Online evals run against real production traffic — you don't pre-know the right answer, so you measure proxies (resolution rate, user thumbs, downstream conversion, override rate). You need both. Offline evals catch regressions before they ship; online evals tell you whether the system actually works for users once it has shipped. Neither replaces the other.

The deep version of why evaluating agents specifically is hard — non-determinism, multi-step error compounding, no single gold answer, path-dependence, eval cost, dataset rot — lives at Operations · Why evaluating agents is hard. The practice of building eval sets into your development loop (CI gates, golden trajectories, the production-to-eval flywheel) is eval-driven agent development. Read them in that order.