Context windows explained.
The context window is the model's entire working memory for a single request. Everything the model "knows" in that moment — system prompt, history, retrieved docs, tool outputs, the question — lives there, and it is finite. This entry explains what counts against it, what happens when you exceed it, why a large window is not free space, and the design discipline of treating context as a managed budget.
One window, everything inside it.
A model has no memory between API calls. Each request is processed from scratch over exactly the tokens you send. The context window is the maximum length of that token sequence — input plus the output the model is generating. A "200K context window" means input and output together must fit in roughly 200,000 tokens.
Critically, the window is a single shared budget. Everything competes for the same space:
CONTEXT WINDOW (e.g. 200,000 tokens total)
+--------------------------------------------------------------+
| system prompt ~1,500 tok |
| tool definitions ~2,000 tok |
| conversation history ~40,000 tok (grows every turn) |
| retrieved documents ~8,000 tok |
| current user message ~500 tok |
| ............ remaining budget for the model's reply ........ |
+--------------------------------------------------------------+
The reply must fit in whatever is left. A long history can starve the model of room to answer — a failure that looks like "the model got cut off" but is really "you spent the budget on input."
It is measured in tokens, not words or characters.
Budgeting in characters or words will mislead you. Roughly: ~4 characters per token for typical English, so ~750 words per 1,000 tokens — but code, JSON, non-English text, and whitespace all change the ratio (non-English often costs 2–4× more per word; see the LLM mental model). Never guess; count. Providers expose token-counting endpoints and tokenizer libraries so you can size inputs before sending them, which matters when you must guarantee room for the output.
What happens at the limit.
Three distinct failure shapes, often confused:
- Hard rejection. If input alone exceeds the window, the API returns an error before generating anything. Clean to detect, annoying in production if unhandled.
- Output truncation. Input fits but leaves little room; the model stops mid-sentence (or mid-JSON) when the combined length hits the limit or your
max_tokenscap. The classic symptom: malformed JSON because generation was cut before the closing brace. - Silent eviction. Many chat frameworks auto-trim old messages to make a long conversation fit. The request succeeds, but the model genuinely no longer sees the dropped turns. "The bot forgot what I said ten minutes ago" is usually this, not a model defect.
The dangerous one is silent eviction, because nothing errors. The model answers confidently using only the surviving context. If your framework auto-trims, know its policy — what it drops, in what order — or you will ship "amnesia" bugs you cannot reproduce.
A big window is not free space.
"Just use the 1M-token model and dump everything in" fails on three axes:
- Cost. You pay for input tokens on every turn. A 100K-token context replayed across a 20-turn conversation is 2M input tokens billed for the same material. Context is usually the most expensive resource an agent burns; prompt caching softens this for stable prefixes but does not eliminate it.
- Latency. More input tokens means more compute before the first output token. Large contexts add seconds of time-to-first-token.
- Quality — "lost in the middle." Models attend most reliably to the start and end of context; information buried in the middle of a long context is recalled less reliably. Newer models mitigate but do not erase this. More context can therefore reduce answer quality even when it technically fits.
The maximum window size is a ceiling, not a target. Effective context — the amount the model actually uses well — is smaller than the nominal limit.
Managing the budget.
Treat context like memory in an embedded system: scarce, deliberately allocated, reclaimed when stale.
- Retrieve, do not dump. Instead of pasting a whole 80-page manual, retrieve the few relevant chunks per query. This is exactly the motivation for RAG, covered next.
- Summarize history. Replace 30 old turns with a compact running summary. Pay tokens for the gist, not the transcript.
- Position deliberately. Put the system contract and the most important documents at the boundaries; restate any non-negotiable instruction immediately before generation, where recency weighting is strongest.
- Reserve output room. Decide the answer's max size first, subtract it from the window, and only then fill the rest with input. Do not discover the deficit via a truncated reply.
- Isolate sub-tasks. A focused sub-agent with a clean, minimal window often outperforms one carrying an enormous accumulated history — less to attend to, less to get lost in.
Deliverable
You see the context window as one finite token budget shared by system prompt, history, retrieved content, tools, the question, and the answer. You count tokens rather than guess from words. You can name the three limit failures — hard rejection, output truncation, silent eviction — and you fear the silent one most. You know a large window is bounded by cost, latency, and lost-in-the-middle, so effective context is smaller than the headline number. And you manage the budget actively: retrieve instead of dump, summarize history, position by importance, and reserve room for the reply.