Evaluating Memory Quality

Deep Dive · Memory & Context Engineering

Evaluating memory quality — and the pitfalls it catches.

Memory systems fail silently. The agent still answers, still acts, still looks fine in a demo — while quietly recalling a stale fact, obeying a poisoned memory, or having forgotten a constraint during compaction. You cannot tune what you do not measure. This essay defines memory-specific metrics and the failure modes they exist to catch.

STEP 1

Why generic eval misses memory bugs.

End-to-end task-success eval will eventually catch a broken memory system, but late, noisily, and without telling you which layer broke. Memory failures are diffuse: a wrong recall on turn 12 surfaces as a bad action on turn 30. You need eval that isolates the memory stack from the agent's reasoning, with probes targeted at each operation: write, recall, compaction, update.

Principle: every memory operation that can lose or distort information gets its own dataset and its own metric. Aggregate task success is the integration test; these are the unit tests.

STEP 2

The core memory metrics.

Recall@k (memory) — with a known fact written N turns ago and a query that should surface it, is it in the top k? The single most important number. Track it as a function of store size and turn-distance — a system that is fine at 100 memories and broken at 50,000 is the normal failure trajectory.
Recall precision / distractor rate — of the memories injected into context, what fraction were actually relevant? High recall with low precision poisons reasoning with confident-looking junk.
Constraint survival — after a compaction, do all hard constraints still hold in the resulting state? Measured as a strict pass/fail per constraint per compaction.
Staleness rate — when a fact has been updated, what fraction of subsequent recalls return the old value? Directly measures the update path.
Write precision — of everything written to long-term memory, what fraction was actually durable-worthy? Low write precision is the leading indicator of future recall collapse.

# eval/memory_eval.py
def recall_at_k(mem, probes, k=5) -> float:
    # probe = (fact_id, query, turns_ago) injected earlier
    hit = 0
    for p in probes:
        got = mem.recall(p.query, k=k)
        hit += int(p.fact_id in {g.id for g in got})
    return hit / len(probes)

def staleness_rate(mem, updates) -> float:
    # update = (key, old_val, new_val); recall after update
    stale = 0
    for u in updates:
        v = mem.recall(u.key, k=1)[0].text
        stale += int(u.old_val in v and u.new_val not in v)
    return stale / len(updates)

Run these against a synthetic long-horizon trajectory (hundreds of turns, planted facts, planted updates, planted constraints) on every change to the memory stack. This is the harness the weight-tuning in retrieval-augmented-memory and the triggers in context-compaction are tuned against.

STEP 3

Pitfall: context poisoning.

A wrong, hallucinated, or adversarial statement gets written to long-term memory, is later recalled with the same authority as a true memory, and propagates — the agent reasons from it, produces output consistent with it, which may itself get written back. A self-reinforcing falsehood.

Gate the write path: do not persist model-generated claims as semantic fact without grounding. An inference is stored as inferred, not as fact (the provenance tagging from retrieval-augmented-memory).
Never put recalled memory in the authoritative system block: it is evidence, not policy. Poisoning a hint is recoverable; poisoning an instruction is obedience.
Eval probe: inject a known-false memory, then measure how often it surfaces in answers and whether the agent ever writes a derived falsehood back. Poisoning that cannot propagate back to the store is contained.

STEP 4

Pitfall: stale memory.

A fact was true when written and is now wrong — the user changed teams, the API version bumped, the policy was revised. The memory system confidently recalls the obsolete value. This is the failure the append-only vector store guarantees and the in-place key-value update (memory-stores) prevents.

Update in place for semantic memory: one key, one current value. New observation overwrites, it does not accumulate.
Timestamp and prefer-recent on conflict: when two memories disagree, recency breaks the tie, and surface the disagreement rather than silently picking one.
Eval probe: the staleness-rate metric above. A non-zero staleness rate means the update path is broken — a P0, because the agent is now confidently wrong.

STEP 5

Pitfall: retrieval drift and compaction amnesia.

Two failures that share a signature — the system looks healthy because it still returns plausible results — and so are invisible without targeted probes.

Retrieval drift: a cue built once at task start keeps recalling early, now-irrelevant memories while the task has moved on. Caught by measuring recall@k as a function of turn-distance into the task, not just at turn 1. A curve that decays mid-task is drift. Fix: rebuild the cue from current state every turn.
Compaction amnesia: a constraint or open loop is dropped during summarization and the agent proceeds as if it never existed. Caught by the constraint-survival and open-loop-survival checks from context-compaction. Fix: a pinned, never-compacted block for hard constraints and active loops.

memory eval report  (synthetic 300-turn trajectory)
  recall@5 @ store=1k        0.91   ok
  recall@5 @ store=50k       0.58   FAIL  ← redundancy collapse
  recall@5 by turn-distance: t1 0.94  t150 0.61  FAIL ← drift
  staleness rate             0.00   ok    (kv in-place update)
  constraint survival        18/20  FAIL  ← compaction dropped 2
  write precision            0.41   WARN  ← tighten write gate

This report is the deliverable. A memory system is not "done" when it works in a ten-turn demo — it is done when these numbers hold on a synthetic long-horizon trajectory at the store size and task length you will actually run in production. Measure first; every other essay in this section is a knob you turn against this dashboard.