Evaluating memory quality — and the pitfalls it catches.
Memory systems fail silently. The agent still answers, still acts, still looks fine in a demo — while quietly recalling a stale fact, obeying a poisoned memory, or having forgotten a constraint during compaction. You cannot tune what you do not measure. This essay defines memory-specific metrics and the failure modes they exist to catch.
Why generic eval misses memory bugs.
End-to-end task-success eval will eventually catch a broken memory system, but late, noisily, and without telling you which layer broke. Memory failures are diffuse: a wrong recall on turn 12 surfaces as a bad action on turn 30. You need eval that isolates the memory stack from the agent's reasoning, with probes targeted at each operation: write, recall, compaction, update.
Principle: every memory operation that can lose or distort information gets its own dataset and its own metric. Aggregate task success is the integration test; these are the unit tests.
The core memory metrics.
- Recall@k (memory) — with a known fact written N turns ago and a query that should surface it, is it in the top k? The single most important number. Track it as a function of store size and turn-distance — a system that is fine at 100 memories and broken at 50,000 is the normal failure trajectory.
- Recall precision / distractor rate — of the memories injected into context, what fraction were actually relevant? High recall with low precision poisons reasoning with confident-looking junk.
- Constraint survival — after a compaction, do all hard constraints still hold in the resulting state? Measured as a strict pass/fail per constraint per compaction.
- Staleness rate — when a fact has been updated, what fraction of subsequent recalls return the old value? Directly measures the update path.
- Write precision — of everything written to long-term memory, what fraction was actually durable-worthy? Low write precision is the leading indicator of future recall collapse.
# eval/memory_eval.py def recall_at_k(mem, probes, k=5) -> float: # probe = (fact_id, query, turns_ago) injected earlier hit = 0 for p in probes: got = mem.recall(p.query, k=k) hit += int(p.fact_id in {g.id for g in got}) return hit / len(probes) def staleness_rate(mem, updates) -> float: # update = (key, old_val, new_val); recall after update stale = 0 for u in updates: v = mem.recall(u.key, k=1)[0].text stale += int(u.old_val in v and u.new_val not in v) return stale / len(updates)
Run these against a synthetic long-horizon trajectory (hundreds of turns, planted facts, planted updates, planted constraints) on every change to the memory stack. This is the harness the weight-tuning in retrieval-augmented-memory and the triggers in context-compaction are tuned against.
Pitfall: context poisoning.
A wrong, hallucinated, or adversarial statement gets written to long-term memory, is later recalled with the same authority as a true memory, and propagates — the agent reasons from it, produces output consistent with it, which may itself get written back. A self-reinforcing falsehood.
- Gate the write path: do not persist model-generated claims as semantic fact without grounding. An inference is stored as inferred, not as fact (the provenance tagging from
retrieval-augmented-memory). - Never put recalled memory in the authoritative system block: it is evidence, not policy. Poisoning a hint is recoverable; poisoning an instruction is obedience.
- Eval probe: inject a known-false memory, then measure how often it surfaces in answers and whether the agent ever writes a derived falsehood back. Poisoning that cannot propagate back to the store is contained.
Pitfall: stale memory.
A fact was true when written and is now wrong — the user changed teams, the API version bumped, the policy was revised. The memory system confidently recalls the obsolete value. This is the failure the append-only vector store guarantees and the in-place key-value update (memory-stores) prevents.
- Update in place for semantic memory: one key, one current value. New observation overwrites, it does not accumulate.
- Timestamp and prefer-recent on conflict: when two memories disagree, recency breaks the tie, and surface the disagreement rather than silently picking one.
- Eval probe: the staleness-rate metric above. A non-zero staleness rate means the update path is broken — a P0, because the agent is now confidently wrong.
Pitfall: retrieval drift and compaction amnesia.
Two failures that share a signature — the system looks healthy because it still returns plausible results — and so are invisible without targeted probes.
- Retrieval drift: a cue built once at task start keeps recalling early, now-irrelevant memories while the task has moved on. Caught by measuring recall@k as a function of turn-distance into the task, not just at turn 1. A curve that decays mid-task is drift. Fix: rebuild the cue from current state every turn.
- Compaction amnesia: a constraint or open loop is dropped during summarization and the agent proceeds as if it never existed. Caught by the constraint-survival and open-loop-survival checks from
context-compaction. Fix: a pinned, never-compacted block for hard constraints and active loops.
memory eval report (synthetic 300-turn trajectory)
recall@5 @ store=1k 0.91 ok
recall@5 @ store=50k 0.58 FAIL ← redundancy collapse
recall@5 by turn-distance: t1 0.94 t150 0.61 FAIL ← drift
staleness rate 0.00 ok (kv in-place update)
constraint survival 18/20 FAIL ← compaction dropped 2
write precision 0.41 WARN ← tighten write gate
This report is the deliverable. A memory system is not "done" when it works in a ten-turn demo — it is done when these numbers hold on a synthetic long-horizon trajectory at the store size and task length you will actually run in production. Measure first; every other essay in this section is a knob you turn against this dashboard.