Evals, in plain words.
An eval is a small, trusted scoreboard you run against your own task — why public benchmarks aren't enough, and what a useful eval set looks like. The complement of reading benchmarks critically: that one taught you to be skeptical of public numbers; this one teaches you what to build in their place.
What an eval actually is.
Strip away the jargon and an eval is three things in a list, repeated:
- An input — a prompt, a document, a task description, a tool-call request.
- An expected behavior — what counts as a good response: the right answer, the right action, the right shape.
- A scoring rule — how you decide, programmatically or by a human, whether what the model produced matched the expectation.
That's it. An eval set is many such tuples, run through the same model (or agent), aggregated into a score. The score is your scoreboard. The set is the bench. The whole thing is the practice of writing down — in a form you can re-run — what "good" means for your task. If you can describe success in code, you can write an eval. If you can't describe it at all, you don't have a defined task yet, and you have a different problem.
Evals work for the same reason unit tests work: a small, repeatable, automatable signal beats a large, lossy, manual one. The model gets better or worse; your scoreboard tells you which.
Why public benchmarks aren't enough.
It's tempting to pick the model topping MMLU or LMSYS Arena and call it done. That fails for four reasons, all of which compound:
- They measure proxy tasks. MMLU tests trivia; the arena tests "answer most users prefer." Neither is your customer-support classifier, your contract-clause extractor, or your code-review agent. A model can ace the proxy and still be wrong for you.
- Saturation. Once a benchmark is solved, every frontier model scores ~95% and the differences vanish into noise. You can't pick between models that all answer "yes" to the same question.
- Contamination. Famous benchmarks leak into training data. A high score may reflect memorization, not capability. The fresher and more obscure the benchmark, the cleaner the signal — which is exactly the property of your private set.
- Selection bias. Vendor benchmark tables select the prompts, the harness, the comparison points. Useful as a claim, useless as an audit.
None of this means public benchmarks are worthless — they're great for shortlisting. But the moment the decision is "which model do I actually ship," the only number that counts is the one your own set produces. The full version of this argument is in reading benchmarks critically; the takeaway here is just: don't outsource the decision to a leaderboard.
What a useful eval set looks like.
The single biggest mistake first-time eval builders make is going for size. They scrape 10,000 examples, write a generic LLM-as-judge prompt, and wonder why the resulting score moves all over the place and tells them nothing. A useful eval set is the other way around:
- Small. 10 to 100 cases is plenty to start. You'll learn more from 20 specific cases you understand individually than from 2,000 you've never read.
- Specific to your task. Built from your real traffic, your real customers, your real edge cases. Generic eval suites measure generic things.
- Scored cheaply, or at least repeatably. Best: a code-checkable rule (exact match, regex, schema validation, tool-call assertion). Second-best: an LLM judge with a tight rubric and pinned model version. Worst-but-sometimes-fine: human review, capped to a sample.
- At least one easy case. A sanity case that should pass trivially — if it doesn't, your harness is broken, not your model.
- At least one hard case. A known-bad example you've fixed before, kept as a regression target. The day a model release silently re-breaks it, your scoreboard will tell you.
Ten specific cases beat a thousand generic ones. If you can't justify why each case is in your set, it shouldn't be in your set. Quality of cases dominates quantity; a noisy 1000-case scoreboard makes worse decisions than a sharp 20-case one.
Online vs offline, and what to read next.
One distinction worth carrying out of this entry: offline evals run against a curated set in a CI-like loop — you control the inputs, you know the expected outputs, you run them before shipping. Online evals run against real production traffic — you don't pre-know the right answer, so you measure proxies (resolution rate, user thumbs, downstream conversion, override rate). You need both. Offline evals catch regressions before they ship; online evals tell you whether the system actually works for users once it has shipped. Neither replaces the other.
The deep version of why evaluating agents specifically is hard — non-determinism, multi-step error compounding, no single gold answer, path-dependence, eval cost, dataset rot — lives at Operations · Why evaluating agents is hard. The practice of building eval sets into your development loop (CI gates, golden trajectories, the production-to-eval flywheel) is eval-driven agent development. Read them in that order.