Generation & Sampling: Temperature

F8
Concepts · AI Foundations

Generation and sampling: temperature explained.

"Why did I get a different answer to the exact same prompt?" is the question that reveals the most important shift in how to think about LLMs: the model does not return an answer, it returns a probability distribution, and a separate sampling step turns that into the text you see. This entry explains that step, demystifies the temperature dial you have seen in every API, and gives you a task-driven rule for setting it.

STEP 1

The model outputs a distribution, not a word.

At each generation step, the model does not choose the next token. It produces a probability distribution over every token in its vocabulary — a confidence score for each of the ~100,000+ possibilities of what could come next. Only after that does a separate sampling step pick one token from that distribution. Then the chosen token is appended and the whole process repeats for the next token.

Prompt: "The capital of France is"
Model's distribution over the next token:
  " Paris"    -> 0.85
  " the"      -> 0.04
  " a"        -> 0.02
  " located"  -> 0.02
  " home"     -> 0.01
  ... ~100,000 more tokens, almost all near zero ...

Sampling picks ONE. Then repeat for the token after that.

This is the mental model that dissolves a whole class of confusion: the model's actual output is the entire ranked distribution. "Generation" is a long sequence of distribution-then-pick steps. Whether the same prompt gives the same text depends entirely on how that pick is made.

STEP 2

Temperature: how sharply to favour the top choice.

Temperature is a single number (typically 0 to about 2) that reshapes the distribution before sampling, controlling how strongly the most probable tokens are favoured over the rest. It does not change what the model "thinks" — the underlying distribution is the same. It only changes how decisively sampling commits to the front-runner.

  • Temperature 0 — always take the single highest-probability token. The distribution is effectively collapsed to its peak. Most repeatable, most predictable, least varied. Sometimes called "greedy."
  • Temperature ≈ 1 — sample from the model's distribution roughly as-is. Its natural level of variability: usually the likely token, but plausible alternatives genuinely surface.
  • Temperature ≈ 2 — flatten the distribution. Unlikely tokens get a real chance. Output becomes more surprising and more diverse, but also less coherent and more error-prone.
  temperature 0          temperature 1          temperature 2
  |#                     |#                     |#
  |#                     |##                    |###
  |#                     |####                  |#####
  |#  . . . . .          |#######  . .          |#########
  one peak, always       likely wins but        nearly flat,
  picks the top          variety is real        anything can appear

  Same underlying distribution; temperature only reshapes it.

The practical mental image: temperature is a "play it safe ↔ take risks" dial applied to text. Low temperature hugs the most-likely path; high temperature wanders off it on purpose.

STEP 3

Choosing temperature by task, not by taste.

The setting should follow the task, not personal preference. A reliable guide:

  • Classification, extraction, structured output: 0 – 0.2. There is a correct answer; you want the highest-probability one. Variety is pure downside here.
  • Tool-calling / decisions in an agent: 0 – 0.3. You want predictable behaviour on the same input. Randomness in which tool gets called is a bug, not a feature.
  • Code generation: ~0.2 – 0.4. Low, but not zero — a little flexibility for novel problems while staying close to well-trodden patterns.
  • Summarising, rewriting: ~0.3 – 0.7. Variety in phrasing is genuinely desirable; the content is constrained anyway.
  • Brainstorming, creative writing: ~0.7 – 1.0+. You want surprise. The single most-likely token is often the most generic and forgettable one.

You may also meet top_p (nucleus sampling), a related dial that instead truncates the distribution — sample only from the smallest set of top tokens whose probabilities sum to top_p (e.g. 0.9). Intuitively it caps how improbable a sampled token may be. Most applications tune temperature alone and leave top_p at its default; reach for it only when you have a specific failure to mitigate.

STEP 4

The trap: "temperature 0 is deterministic."

The most consequential misconception. Temperature 0 makes output much more consistent, but it is not a guarantee of identical results, for reasons that have nothing to do with sampling randomness:

  • Floating-point non-determinism. Inference servers batch many requests together for efficiency. The exact batch changes the order of tiny numerical additions, and floating-point addition is not perfectly associative. Usually invisible — but occasionally enough to flip which token is ranked first, especially when the top two are nearly tied.
  • Model snapshot updates. The same API model name can point to slightly updated weights over time. Same call, slightly different distribution.
  • Server-side variation. Caching, routing, and fallback machinery introduce small perturbations even at temperature 0.

So treat temperature 0 as "low variance," never "no variance." If you need genuine reproducibility for tests, combine temperature 0 with a fixed seed where supported and ideally a pinned model version — and still expect rare drift, since providers document seeds as best-effort, not contractual.

The real upgrade: stop calling the model "wrong" when it gives a different answer to the same prompt. It is sampling from a distribution. The useful question is not "why isn't it deterministic?" but "is the distribution centred on the right answer with appropriate confidence?" — and that is measurable. A model that is right 80% of the time is not broken when one run lands in the other 20%; that is sampling, working as designed. Use a low temperature when you want consistency, a higher one when you want range, and judge quality over many runs, not one.