Online vs offline evals: each catches what the other can't see.
Offline evals catch regressions before deploy on a fixed, repeatable set; online evals catch the user behavior you couldn't fake. Either alone is a half-measurement, and both have characteristic lies. This essay is about what each kind of eval is good at, where each one quietly misleads you, the bridges between them (shadow mode, replay, dark launches), and how to budget the two against each other.
Offline: deterministic, repeatable, and a lie about the user.
An offline eval runs your agent against a fixed dataset of tasks, in a controlled environment, before any real user sees the change. The strengths are real: you can re-run the same set against any candidate (model bump, prompt edit, tool swap) and read the delta; you can gate CI on it; you can drive variance down by sampling each task k times; you can produce a number a release engineer trusts enough to flip a flag.
The lies offline evals tell you are well-catalogued. Selection bias: the tasks in the set are the ones someone thought to write, not the distribution your real users generate. Distribution drift: the set is a snapshot of last quarter's users, and product changes have moved the median request since then. Contamination: a public benchmark you trust has bled into pretraining; a frontier model may have memorized the answer key. Eval-set rot, the inverse of distribution drift, from your side: tasks reference APIs that changed, web pages that redesigned, tickets that were closed. The bar is set up by why-agent-eval-is-hard; offline eval clears most of it well, except the parts that depend on real-user distribution.
Online: real users, real distribution, and a lie about what they wanted.
An online eval observes the live system: real users, real prompts, real outcomes, scored either by an automatic predicate (the user accepted the suggestion, the task closed, the support ticket did not re-open) or by sampling for human review. The strengths are the inverse of offline's lies: the distribution is the real one; novel user behavior shows up; the kinds of bug that only manifest on long-tail prompts surface naturally.
The lies online evals tell you are subtler and harder to catch. The Hawthorne effect: users who know they are being observed behave differently. Novelty: new features get a curiosity bump that fades within weeks; an early online win is often just first-week interest, not durable value. Satisficing vs success: a user accepting the agent's suggestion does not mean the suggestion was good; it means it was good enough that they didn't bother to push back. Survivorship: the users who hated the agent the most have already left, and they are not in your online numbers. Most dangerously, an online eval cannot tell you about regressions you have not yet shipped — you can only score what is in front of users.
The cleanest online metric is one a user cannot game by ignoring the agent: did the underlying task complete, did the ticket stay closed, did the user not re-ask the same thing within the session. "Thumbs up" and "did not retry" are weak signals; "the outcome the agent was supposed to produce actually happened" is the real one.
You need both, because neither alone covers the failure surface.
Offline catches regressions before they ship; online catches what offline could not have known to test. The two together form the gate-and-monitor pair that lets you actually move:
- Offline gates promotion — no green offline run, no promotion of a new release triple from rollout-and-versioning beyond shadow.
- Online watches behavior after promotion — anomaly-detect on the live signal, escalate to a flag flip (or a kill switch) if a metric regresses on real users.
- The feedback loop closes — production traces that surfaced a novel failure mode become tomorrow's offline tasks; this is the "production-to-eval flywheel" from eval-driven-agent-development.
Skipping offline because "we'll catch it online" is the team that learns about every regression through a customer ticket. Skipping online because "the offline suite passes" is the team that ships a change that scores +3 in CI and −15 in real usage and finds out a week later.
The bridges: shadow mode, replay, dark launches.
The two modes are not as far apart as the labels suggest. Three patterns sit on the bridge between them and are usually higher-leverage than either pure mode:
- Shadow mode — the candidate runs in parallel with the live system on real traffic, but only the live system's output reaches the user. You compare the two offline, on real distribution, with zero user risk. The best first gate for a model snapshot bump.
- Replay — captured production traces are re-run against a candidate as if they were offline tasks. Distribution is real; cost is paid once; you can stamp every replay with its original release triple to debug behavior drift.
- Dark launches — the new behavior is fully live for a tiny slice of internal traffic that scores it without acting on the result. Used to validate a scoring rubric or a judge before you bet a real release on it.
All three give you "online distribution at offline risk" — the part of the eval surface where most of the real work gets done.
The cost asymmetry — and how to budget it.
The two modes burn different things. Offline eval burns engineering hours to author and maintain the task set, machine time per eval run, and the cost of grading (judge or human). Online eval burns real tokens on real users — every "scored" request costs the same as a normal one — plus the operational cost of the scoring pipeline and the trust cost of any wrong action that landed before the score caught it.
The budget that holds up: a fast offline gate on every push (cheap, narrow, catches obvious regressions); the full offline suite nightly or per-release-candidate (broader, decisive on promotion); online scoring on a sampled fraction of production (1–10%, sized to the cost), continuous; full online review of any cohort flagged by an online anomaly (human-in-the-loop). The mistake to avoid is paying the price of "online for everything, always" — most regressions can be caught for an order of magnitude less with the offline+sample combination, and the savings buy you the human review that actually catches the subtle ones.