Transformers, at a High Level

Concepts · AI Foundations

Transformers, at a high level.

Every modern LLM is a "transformer" — the "T" in GPT. You do not need the mathematics to understand why this one architecture displaced everything before it. This entry explains the single idea that makes transformers work (attention: letting every word look at every other word), why that idea unlocked massive scale, and what it predicts about how these models behave.

STEP 1

The problem the transformer solved.

Before 2017, the leading networks for language read text the way you read a sentence aloud — strictly left to right, one word at a time, carrying a running summary in a small memory. This had two crippling weaknesses. First, long-range memory was poor: by the end of a long paragraph, the influence of the first sentence had largely faded, so the model struggled to connect a pronoun to a noun mentioned far earlier. Second, the strictly sequential processing could not be parallelised — word 100 could not be computed until word 99 was done — which capped how large and how fast these models could get.

The 2017 paper Attention Is All You Need introduced the transformer, which removed the sequential bottleneck entirely and, almost as a side effect, made models far better at long-range connections. Within a few years it had replaced the previous approaches across language, and later vision and audio too.

STEP 2

The core idea: attention.

The central mechanism is self-attention. Instead of forcing information through a narrow left-to-right memory, the transformer lets every token look directly at every other token in the input and decide, for itself, which ones are relevant right now.

The standard intuition is the sentence: "The animal didn't cross the street because it was too tired." To represent it, the model must know whether it refers to the animal or the street. Attention lets the token it "look at" all the other tokens and assign each an importance weight; here it places most of its weight on animal, building a representation of it that effectively means "the animal." Change "tired" to "wide" and attention shifts its weight to street instead. Every token does this, in parallel, at every layer — repeatedly mixing information so that each position's representation is informed by the whole context, not just its neighbours.

"The animal didn't cross the street because it was too tired."
                                            ^^
   "it" attends most strongly to ->  "animal"   (not "street")
   each token: build a query, compare to every other token,
   take a weighted blend of the ones that matter

You can ignore the exact math (the "query/key/value" machinery) and keep the essence: attention is each token gathering information from the most relevant other tokens, with the relevance learned, not hand-coded.

STEP 3

Why this unlocked scale.

Attention is not just more accurate; it is the reason today's models are as large as they are. Because every token attends to every token in one shot, the heavy computation is matrix multiplication that runs in parallel across the whole sequence — exactly the workload modern GPUs are built for. The old strictly-sequential models could not be parallelised this way; transformers can, so they could be trained on far more data with far more parameters in feasible time.

That parallelisability is the hidden hinge of the whole field. The scaling story from the LLM entry — bigger models, more data, surprising emergent abilities — was only practical because the transformer turned language modelling into a workload that scales cleanly on parallel hardware. A few more pieces complete the architecture: many attention layers stacked deep; a "feed-forward" sublayer after each attention step to further process the result; and a position signal added at the input so the model still knows word order despite looking at everything at once. The detail to keep is just the shape — a deep stack of attention-plus-processing blocks.

"GPT" = Generative Pre-trained Transformer. "Transformer" is this architecture; "pre-trained" is the next-token training objective from the training-vs-inference entry; "generative" is the one-token-at-a-time loop from the LLM entry. The buzzword is just three concepts you have already met, stacked together.

STEP 4

What the architecture predicts about behaviour.

Knowing the model is a transformer explains several day-to-day behaviours, so they stop feeling arbitrary:

Strong use of in-prompt context. Because attention lets generation reach back to any earlier token, models follow instructions, examples, and pasted documents in the prompt remarkably well. This is the architectural reason prompting, retrieval, and tool use work at all.
Cost grows steeply with context length. "Every token attends to every token" means the attention work grows roughly with the square of the sequence length. Doubling the input more than doubles the attention cost — the core reason long contexts are expensive and slower, and why context engineering matters.
Position effects. Because order is added as an extra signal rather than enforced by sequential processing, where information sits in the context can measurably affect how reliably the model uses it — the basis of the "lost in the middle" effect.
Generality across modalities. Nothing in attention is specific to words. Feed it image patches or audio chunks and the same machinery applies, which is why one architecture now underlies text, vision, and multimodal models alike.

The durable summary: a transformer is a deep stack of layers in which every position can attend to every other position, with relevance learned from data. That one idea fixed long-range memory, unlocked GPU-scale training, and is the structural reason LLMs are both so capable and so sensitive to what — and where — you put in their context.