Memory Types: Episodic, Semantic, Procedural

M3
Deep Dive · Memory & Context Engineering

Memory types: episodic, semantic, procedural — and the scratchpad.

"Long-term memory" is not one thing. Effective agents keep at least three functionally distinct kinds of memory plus a within-task scratchpad, each written, indexed, and retrieved differently. Storing them in one undifferentiated blob is why so many memory systems retrieve confidently irrelevant junk.

STEP 1

Three durable kinds, one transient one.

The taxonomy is borrowed from cognitive science but earns its place operationally: each kind answers a different question and therefore needs a different retrieval cue.

  • Episodic — "what happened." Time-stamped records of specific events: this user asked X, the agent ran tool Y, it returned error Z, the fix was W. Answers "have I seen this situation before, and what happened?"
  • Semantic — "what is true." Distilled, time-invariant facts: the user's timezone, the database is Postgres 16, the deploy policy. Answers "what do I know about the world / user / system?"
  • Procedural — "how to do it." Reusable skills and routines the agent has acquired or been given: the steps to roll back a migration, a checklist for triaging an incident. Answers "how do I accomplish this kind of task?"
  • Scratchpad (working) — transient reasoning state for the current task only: a plan, a running tally, intermediate results. Not persisted by default; it dies with the task unless something in it is promoted to one of the three durable kinds.

The test for which kind a piece of information is: ask what query should retrieve it. "When did we last deploy?" → episodic. "What is the deploy policy?" → semantic. "How do I deploy?" → procedural. If the same string would be the answer to all three, you have not distilled it yet.

STEP 2

Type the memory at write time.

Tag every memory with its kind on write. This single field changes how it is indexed, scored, decayed, and retrieved — downstream code branches on it.

# memory/types.py
from enum import Enum
from dataclasses import dataclass

class Kind(Enum):
    EPISODIC   = "episodic"
    SEMANTIC   = "semantic"
    PROCEDURAL = "procedural"

@dataclass
class Memory:
    text: str
    kind: Kind
    ts: float            # event time (episodic) / write time
    salience: float      # importance, drives decay & eviction
    last_used: float     # for recency-aware scoring
    source: str          # turn id / user / derived

Why kind drives behavior:

  • Episodic ages: recency matters, and old episodes are decayed or summarized. Retrieval is similarity plus a recency term.
  • Semantic should not age out by time — a stable fact stays true. But it must be updated in place when it changes (the user moved timezones), not appended, or you accumulate contradictions.
  • Procedural is rarely retrieved by semantic similarity to a goal; it is keyed by task type. Retrieve by "what kind of task am I doing" not "what words are in the goal."
STEP 3

Reflection: turning episodes into semantics.

Raw episodes accumulate fast and individually carry little signal. The technique from the Generative Agents line of work (Park et al., 2023) is reflection: periodically run a model pass over recent episodes and synthesize higher-level, durable conclusions — promoting episodic memory into semantic memory.

# memory/reflect.py
REFLECT = """Here are recent events the agent observed:
{episodes}

Infer up to 3 durable, general conclusions that will
still be true and useful in future, unrelated tasks.
Output one per line. No speculation, no restatement
of a single event."""

def reflect(episodes: list[Memory], llm) -> list[Memory]:
    joined = "\n".join(f"- {e.text}" for e in episodes)
    out = llm(REFLECT.format(episodes=joined))
    return [
        Memory(text=line.strip("- "), kind=Kind.SEMANTIC,
               ts=now(), salience=0.7, last_used=now(),
               source="reflection")
        for line in out.splitlines() if line.strip()
    ]

Reflection is the engine that keeps long-term memory small and high-signal: dozens of low-value episodes collapse into a handful of high-value semantic facts. The original episodes can then be aggressively decayed because their durable content has been extracted. Without reflection, episodic memory grows linearly forever and retrieval quality decays with it.

Run reflection on a cadence tied to volume, not wall-clock: e.g. every N new episodes, or when episodic store size crosses a threshold. Tie it to the same trigger as compaction so the two operations stay coherent.

STEP 4

The scratchpad: structured working state, not a chat log.

The scratchpad is short-term, but it deserves its own treatment because the failure mode is specific: agents that "think out loud" into the message history bury the actual plan under reasoning prose, and on later turns cannot find their own decisions.

Keep the scratchpad as a small, structured, explicitly rewritten object — not an ever-growing append log:

SCRATCHPAD (rewritten each turn, ~300 tok cap)
  goal:     migrate users table to add `tier` column
  plan:     [x] write migration  [x] test on staging
            [ ] get approval     [ ] apply to prod
  facts:    prod deploys gated on staging check (semantic)
            migration is reversible (verified turn 4)
  open:     awaiting approver response

Rewriting beats appending: the model regenerates the scratchpad each turn, which forces it to drop resolved items and surface open ones. At task end, a single pass extracts anything durable from facts into semantic memory and the outcome into episodic memory; the rest is discarded.

Do not let the scratchpad and the message history both try to be the working memory. Pick one as authoritative. A common, robust choice: history is an immutable event log; the scratchpad is the single mutable "current understanding," and only the scratchpad plus the last few raw turns enter the working set.

STEP 5

How the four interact.

A turn in a mature agent touches all four:

  • Procedural retrieved by task type tells the agent how to proceed.
  • Semantic retrieved by goal tells it the constraints and facts that apply.
  • Episodic retrieved by situation similarity tells it what happened last time something like this was attempted.
  • Scratchpad holds the live plan and is rewritten as the task progresses.

Separating them is not bureaucratic neatness — it is what lets each be retrieved by the right cue and decayed on the right schedule. The next essays cover how retrieval actually surfaces these (retrieval-augmented-memory), how the stores are kept bounded (context-compaction), and which backends fit which kind (memory-stores).