Memory Stores: Vector, KV, Graph & Eviction

Deep Dive · Memory & Context Engineering

Memory stores: vector, key-value, and graph tradeoffs — plus eviction.

The backend you persist memory in determines what you can ask of it. Vector stores answer "what is similar," key-value stores answer "what is exactly this," graph stores answer "what is connected." Most production memory systems need two of the three, routed by memory kind. And whatever the backend, an unbounded store degrades retrieval — eviction is part of the design, not an afterthought.

STEP 1

Three access patterns, three backends.

Vector store (Chroma, Qdrant, pgvector, …) — approximate nearest-neighbour over embeddings. Strength: fuzzy, semantic recall — "find memories like this situation." Weakness: no exact lookup, no relational structure, recall quality degrades as the index fills with near-duplicates.
Key-value store (Redis, a table, even a dict) — exact retrieval by a known key. Strength: user:tz → "Europe/Berlin" is O(1), unambiguous, trivially updatable in place. Weakness: you must know the key; no similarity, no discovery.
Graph store (Neo4j, a triple store) — entities and typed relations. Strength: multi-hop — "who reported the bug that the migration fixed." Weakness: extraction is expensive and brittle, and most agent memory is not naturally relational.

Map backend to memory kind, not to fashion. Semantic facts → key-value (exact, updatable in place). Episodic memory → vector (recall by situation similarity). Procedural → key-value keyed by task type. Reach for a graph only when the queries are genuinely multi-hop relational — it is the highest-cost option and the one most often adopted prematurely.

STEP 2

A unified memory interface over heterogeneous backends.

The agent should not know or care which backend a memory lives in. Route on kind behind one interface; this also lets you swap backends without touching agent code.

# memory/store.py
class MemoryStore:
    def __init__(self, vec, kv, embed):
        self.vec = vec      # episodic: similarity recall
        self.kv = kv        # semantic/procedural: exact key
        self.embed = embed

    def write(self, m: Memory) -> None:
        if m.kind is Kind.EPISODIC:
            self.vec.upsert(self.embed(m.text), m.__dict__)
        else:                            # SEMANTIC / PROCEDURAL
            # Key by stable identity → update in place,
            # never accumulate contradictory copies.
            self.kv.put(m.key(), m.__dict__)

    def recall(self, cue: str, kind: Kind, k=5):
        if kind is Kind.EPISODIC:
            return self.vec.search(self.embed(cue), k=k)
        return self.kv.get_prefix(kind.value)   # scoped exact

The kv.put(m.key(), ...) path is the quiet hero. Semantic memory written by key is updated in place when a fact changes — the user's timezone is one entry, not a growing pile of contradictory observations the retriever has to disambiguate. This single design choice eliminates an entire class of "the agent believes two conflicting things" bugs.

STEP 3

The unbounded-store failure, and why eviction is mandatory.

Every memory system that "just keeps everything" degrades. The mechanism is concrete: as a vector index fills with semantically clustered near-duplicates (forty slightly different phrasings of the same event), nearest-neighbour search returns a tight cluster of redundant low-value hits and crowds out the one genuinely useful memory. Retrieval recall@k falls as the store grows. More memory makes the agent dumber.

"Storage is cheap so keep everything" is true for storage and false for retrieval. The cost of an unbounded memory store is not disk — it is collapsing recall precision and an agent that gets measurably worse the longer it runs. Bounded, curated memory beats exhaustive memory at every size.

STEP 4

Eviction and decay policies.

Borrow from cache design, with a memory-specific twist: relevance, not just recency or frequency.

TTL / age decay — episodic memory loses salience over time; below a floor it is evicted (or first summarized, then evicted). Semantic memory does not age — it is superseded by update, not by time.
LRU-by-access — the last_used refresh from retrieval-augmented-memory means never-recalled memories cool and become eviction targets. Frequently useful memories stay resident.
Salience-weighted — explicitly important memories (user said "always remember") resist eviction regardless of age or access. Salience is a floor on lifetime.
Redundancy collapse — the most memory-specific policy: periodically cluster near-duplicates and merge each cluster into one canonical memory. This is reflection (memory-types) doing double duty as garbage collection.

# memory/evict.py
def decayed_salience(m, now) -> float:
    if m.kind is not Kind.EPISODIC:
        return m.salience                 # semantic: no time decay
    age = now - m.last_used
    return m.salience * 0.5 ** (age / (14 * 86400))

def sweep(store, now, floor=0.15, max_n=50_000):
    items = store.all()
    for m in items:
        if decayed_salience(m, now) < floor:
            summarize_then_drop(store, m)   # distil, not delete
    if store.count() > max_n:                # hard cap backstop
        evict_lowest_score(store, store.count() - max_n)

Eviction should almost always be summarize-then-drop, not delete. Distil the evicted memory's durable residue into semantic memory (or a coarse archive summary) first. Hard delete only true noise — pleasantries, exact duplicates, superseded values. You are compressing the past, not erasing it.

STEP 5

Choosing, in practice.

A pragmatic default that covers the large majority of agents: key-value for semantic and procedural memory, vector for episodic memory, eviction-by-decay on the vector tier, in-place update on the key-value tier. Add a graph only when you have concrete queries that are inherently multi-hop relational and you have measured that flattening them into vector or KV loses real accuracy.

Do not start with a graph because it sounds principled, and do not start with a single giant vector store because it sounds simple — the first is over-engineered, the second silently rots. The store is infrastructure; route by memory kind, bound every tier, and let evaluating-memory tell you when a backend choice is actually costing you accuracy.