Short-term vs long-term memory: the working context and the store behind it.
Agents need two distinct memory systems with different lifetimes, costs, and failure modes. Short-term memory is the in-prompt working context — fast, fully attended, ephemeral. Long-term memory is an external store — durable, large, but only useful when the right slice is retrieved back in. Conflating the two is the most common memory-architecture mistake.
Two systems, drawn from cognitive analogy — used carefully.
The human analogy is a useful starting frame, not a blueprint. Working memory holds a handful of items under active manipulation; long-term memory is vast, durable, and accessed by cued recall. For agents the mapping is concrete:
- Short-term memory = the context window. The current system prompt, recent turns, fresh tool results, the scratchpad. It is read at full fidelity by attention on every token. It vanishes the moment the request returns — nothing persists unless you write it somewhere.
- Long-term memory = an external store. A vector index, key-value store, graph, or plain files. It survives across turns, sessions, and process restarts. The model cannot "see" it; it must be queried and the result spliced into short-term memory to have any effect.
The decisive difference is not size or durability — it is attention. Short-term memory is what the model actually computes over this turn. Long-term memory is inert until retrieval promotes a slice of it into short-term memory.
Short-term memory: what earns a slot.
Working context is the scarcest resource you have (see context-budgeting). Treat admission as a decision with a cost. A practical policy: content earns a working-set slot only if it is needed for the next decision, not merely related to the task.
Concretely, the working set on a typical turn should hold:
- The task / current user goal — always.
- The last k turns verbatim — recency dominates relevance for in-flight reasoning.
- Open loops: pending tool calls, unfinished sub-goals, unresolved questions.
- A small, explicitly maintained scratchpad of decisions and facts the agent itself flagged as durable.
Everything else — resolved sub-tasks, stale tool output, turns older than k — is a candidate for eviction to long-term memory.
# memory/working_set.py from collections import deque class WorkingSet: def __init__(self, max_turns: int = 12): self.turns: deque = deque(maxlen=max_turns) self.scratchpad: list[str] = [] # durable, agent-curated def add_turn(self, turn: dict) -> dict | None: evicted = self.turns[0] if len(self.turns) == self.turns.maxlen else None self.turns.append(turn) return evicted # caller persists it to long-term memory def note(self, fact: str) -> None: # Agent explicitly promotes a fact it must not forget. self.scratchpad.append(fact)
The eviction path is the load-bearing detail. Dropping a turn off the end of a deque without persisting it is amnesia; dropping it after writing a durable trace to long-term memory is forgetting-with-recall, which is what you want.
Long-term memory: when to write, when to recall.
Two independent decisions, and most bad memory systems get them backwards by writing everything and recalling everything.
When to write
Do not persist raw transcripts indiscriminately — that is a slow leak that pollutes future retrieval. Write when content has durable value beyond this turn:
- Stable facts about the user, environment, or task ("deploys are gated on the staging check").
- Outcomes and their causes — what was tried, what worked, what failed and why (episodic).
- Reusable procedures the agent derived (procedural).
- Explicit user instructions to remember.
When to recall
Recall is triggered by the current need, not by a schedule. Before a turn, form a retrieval query from the active goal and pull only the top few items that clear a relevance threshold. Recalling marginally-relevant memories is not free — it spends working-set budget and adds distractors that measurably degrade reasoning.
# memory/longterm.py class LongTermMemory: def __init__(self, store, embedder): self.store = store # vector / kv / graph backend self.embed = embedder def write(self, text: str, kind: str, meta: dict) -> None: if not self._is_durable(text, kind): return # write gate: skip ephemera self.store.upsert( vector=self.embed(text), payload={"text": text, "kind": kind, **meta, "ts": now()}, ) def recall(self, goal: str, k: int = 5, min_score: float = 0.35) -> list[dict]: hits = self.store.search(self.embed(goal), k=k * 3) hits = [h for h in hits if h.score >= min_score] return hits[:k] # threshold THEN truncate
The single most common long-term-memory failure: an unbounded write path. Every turn is appended, the store fills with near-duplicate transcript noise, retrieval recall@k collapses, and the agent gets worse the longer it runs. The write gate is not optional.
The promotion / demotion cycle.
Healthy agents continuously move information between the two systems. The cycle, once per turn:
1. RECALL query long-term with current goal → candidate memories
2. PROMOTE splice top-k (above threshold) into working set
3. REASON model acts over working set, produces new turn
4. DEMOTE evict oldest / resolved working-set items
5. WRITE persist demoted items that pass the durability gate
→ loop
This is the same fault-in / page-out loop from context-budgeting, now with an explicit durability gate on the write side and a relevance threshold on the read side. Get those two filters right and the rest of the memory stack — types, stores, compaction — is tuning. Get them wrong and no vector database will save you.
Test the two systems independently. For short-term: does the agent keep the right things in-window across a long task? For long-term: given a known fact written 50 turns ago, is it recalled when relevant? Different metrics, different fixes — covered in evaluating-memory.