Reasoning vs non-reasoning models.
This entry explains the capability axis that emerged across the field from 2024 onward: models that spend extra inference-time compute "thinking" before they answer. You will leave able to say what a reasoning model actually does differently, when that helps, when it wastes money, and why the once-sharp line between "reasoning" and "chat" models is now a dial rather than a category.
What "reasoning" means mechanically.
A non-reasoning ("direct") model reads the prompt and immediately starts emitting the answer. A reasoning model first generates an internal chain of intermediate tokens — exploring approaches, doing sub-steps, checking itself — and only then emits the visible answer. That intermediate stream is variously called thinking, reasoning, or scratchpad tokens. It is usually hidden or summarized in the API, but it is real generation: it consumes tokens, time, and money.
The core idea is inference-time compute: spend more computation per question at answer time, not just more parameters or more training. Letting the model "work it out" before committing measurably improves accuracy on problems that decompose into steps — and does little or nothing for problems that do not.
This is conceptually the same effect as prompting an ordinary model to "think step by step," but trained in and far more capable. The model has learned to produce long, productive reasoning traces and to use them, rather than relying on a prompt trick.
When it helps, wastes, or hurts.
- Helps. Multi-step math, logic and planning, complex code generation and review, anything where the model benefits from exploring and rejecting options before committing. The thinking tokens let it catch its own mistakes mid-stream.
- Wastes money. Simple factual lookup, classification, extraction, short conversational replies. There is nothing to reason about — the model either knows it or does not — so the extra tokens are pure cost and latency with no accuracy gain.
- Can actively hurt. A few task types where a calibrated direct answer is best and extended thinking introduces overthinking or second-guessing. More thinking is not monotonically better.
The practical heuristic: reasoning models trade latency and cost for accuracy on hard problems. That trade is excellent when the problem is genuinely hard and the cost of a wrong answer is high; it is a bad trade for high-volume, easy, latency-sensitive traffic.
The category is collapsing into a dial.
Early on, providers shipped distinct "reasoning" model families separate from their "chat" models, and you chose one or the other. The clear trend by 2026 is hybrid models with an adjustable thinking budget: one model that can think a lot, a little, or not at all, controlled by a parameter on the request. The mental model shifts from "pick the reasoning model or the chat model" to "set how much the model is allowed to think for this request."
That is a better mental model anyway, because the right thinking budget is per-request, not per-model. The same model should think hard about a thorny planning step and answer instantly for a trivial classification in the very same agent.
The visible reasoning trace is not a faithful explanation of how the model reached its answer. It improves accuracy, but treating it as a trustworthy audit log of the model's "real" reasoning is a known mistake. Use it as a debugging signal and a quality lever, not as ground truth about the model's internal process.
How this lands in agent design.
Inside a multi-step agent, do not pick one thinking level for the whole loop. Match the budget to the step:
- Routing / tool selection: minimal or no thinking — you want fast, predictable dispatch.
- Hard planning or self-correction step: generous thinking budget — this is exactly where it pays for itself.
- Final answer synthesis: moderate or none, depending on whether the synthesis itself is hard.
The marker that thinking is wasted on a step: when you can inspect the trace and it is just paraphrasing the question or rambling rather than doing real sub-work. That is a signal to lower the budget or drop to a direct model for that step. Like everything in this section, the only reliable arbiter is your own evaluation set — measure accuracy and cost and latency with thinking on versus off, per step, on tasks that look like yours.