Engineering the Context Window

Deep Dive · Memory & Context Engineering

Engineering the context window: budgeting, ordering, and position effects.

The context window is the only memory an LLM truly has at inference time. Everything else — vector stores, scratchpads, databases — exists to decide what goes into that window. Treat the window as a scarce, engineered resource with an explicit budget, not as a bag you keep stuffing until it overflows.

STEP 1

The context problem, stated precisely.

An LLM is a pure function of its input tokens. It has no hidden state that survives between calls. A long-running agent — one that browses, calls tools, and reasons over many turns — generates an unbounded trajectory: messages, tool results, observations, intermediate reasoning. The window is finite (200K, 1M, whatever). The entire discipline of context engineering is the management of that mismatch.

Three forces act against you as the trajectory grows:

Hard limit: past the window size, the request fails or silently truncates.
Cost & latency: attention is roughly linear-to-superlinear in tokens; a 150K-token prompt is slow and expensive on every single turn, not once.
Quality decay: even well within the limit, models attend unevenly. Relevant facts buried in the middle of a huge context are effectively invisible — the "lost in the middle" effect.

A bigger context window does not solve the context problem. It raises the ceiling and makes the quality-decay problem worse, because now you can stuff 800K tokens of mostly-irrelevant history into every turn and watch accuracy quietly fall.

STEP 2

Give every turn an explicit token budget.

Before any clever retrieval or summarization, you need a number. Decide, per turn, how many tokens each category of content is allowed. Everything else gets compacted, dropped, or moved to external memory.

# context/budget.py
from dataclasses import dataclass

@dataclass
class ContextBudget:
    total: int            # model window minus a safety margin
    reserve_output: int   # tokens kept free for the response
    system: int           # instructions, persona, policies
    tools: int            # tool schemas / definitions
    long_term: int       # retrieved memories & documents
    working: int         # recent turns / scratchpad

    def input_budget(self) -> int:
        return self.total - self.reserve_output

    def check(self) -> None:
        used = self.system + self.tools + self.long_term + self.working
        if used > self.input_budget():
            raise ValueError(
                f"over budget by {used - self.input_budget()} tok")

# Example: 200K window, conservative split.
BUDGET = ContextBudget(
    total=200_000, reserve_output=8_000,
    system=3_000, tools=4_000,
    long_term=60_000, working=120_000,
)

The exact numbers are workload-specific and you will tune them with evals. The discipline is what matters: every category has a ceiling, and the assembler is forced to make a choice when a category overflows instead of letting the prompt grow without bound.

STEP 3

Order content for position, not for convenience.

Transformers do not weight all positions equally. Empirically, content at the start and end of the context is recalled far more reliably than content in the middle. Two practical consequences:

Stable, authoritative content goes at the front: system instructions, the task definition, tool contracts. These also benefit from prompt caching because they do not change between turns.
The most decision-relevant content goes at the end: the current question, the freshest tool results, the top retrieved chunks. This is the last thing the model reads before it generates.
Bulk, lower-confidence material goes in the middle — and you accept it may be under-attended, so it should never be the sole carrier of a critical fact.

A useful assembly template, front to back:

[ system + policies ]        ← stable, cached, authoritative
[ tool definitions ]         ← stable, cached
[ retrieved long-term memory] ← bulk; ranked best-last
[ compacted older turns ]    ← summary, not raw
[ recent working turns ]     ← verbatim, high fidelity
[ current user / task turn ] ← last token the model reads

If a fact is critical — a constraint the agent must never violate — do not rely on it surviving in the middle of a 100K-token prompt. Restate it in the system block and immediately before the action that depends on it.

STEP 4

Measure utilization, not just whether it fit.

"It didn't error" is not success. Instrument the assembler so every turn records how the budget was spent and how much of it was actually used by the model.

# context/assemble.py
def assemble(budget, system, tools, memories, history, task):
    parts, used = [], {}

    parts.append(system);           used["system"] = ntok(system)
    parts.append(tools);            used["tools"]  = ntok(tools)

    # Fill long-term up to its ceiling, best-ranked LAST.
    mem = fit(memories, budget.long_term, keep="tail")
    parts.append(mem);              used["long_term"] = ntok(mem)

    # Working set: verbatim recent, compact the overflow.
    work = fit(history, budget.working, keep="tail",
               overflow=compact)
    parts.append(work);             used["working"] = ntok(work)

    parts.append(task)              # always last, never dropped

    log_metrics(used, budget)       # for offline analysis
    return "\n\n".join(parts)

Track these over a representative eval set:

Fill ratio per category — is long_term always saturated while working is half empty? Rebalance.
Eviction rate — how often is content dropped, and does task success correlate with what got dropped?
Position of the answer-bearing content — if the chunk that answered the question was at position 0.5 (dead middle), expect failures and reorder.

STEP 5

The mental model: context as a working set.

Borrow the term from operating systems. A process has a huge virtual address space but a small physical RAM; the OS keeps the working set — the pages actually needed right now — resident and pages the rest to disk. An agent has an unbounded trajectory but a small context window; context engineering keeps the working set in-window and pages the rest to external memory (vector store, KV store, files), faulting it back in via retrieval when needed.

Everything in the rest of this section — short- vs long-term memory, memory types, retrieval-augmented memory, compaction, memory stores, evaluation — is a strategy for one of three operations: what to keep resident, what to page out, and how to fault the right thing back in. Budgeting is the accounting layer that makes those decisions explicit and measurable.

Rule of thumb: if you cannot say, in tokens, how much of your context is system vs tools vs memory vs working set on a typical turn, you are not engineering context — you are hoping. Start with the budget.