Reflection — verify, critique, revise.
Reflection adds a self-correction step: the agent generates, critiques its own output, and revises. It can meaningfully raise quality — but only on tasks with a usable correctness signal. This essay covers self-refine vs. Reflexion, where the signal comes from, and the well-documented failure where self-critique makes things worse.
Two flavors: self-refine and Reflexion.
Self-Refine (Madaan et al., 2023) is a single-episode loop: produce a draft, prompt the same model for actionable feedback on that draft, then produce a revision conditioned on the feedback. Repeat for a fixed budget or until feedback says "good enough." No external environment, no memory across tasks — purely intra-task polish.
Reflexion (Shinn et al., 2023) is broader: after a trajectory fails an external check (a unit test, a task reward, an environment signal), the model writes a natural-language post-mortem and stores it in episodic memory. The reflection is prepended to the context on the next attempt, so the agent learns from its own failure across episodes. Reflexion is verbal reinforcement learning: the policy is fixed, but the context carries a learned lesson.
The crucial difference is the source of truth. Self-Refine's critic is the model judging itself. Reflexion's trigger is an external signal (a failed test). This distinction predicts when each works — and it is the single most important thing to internalize about reflection.
The loop.
# Reflection with an external verifier (the version that reliably helps) draft = generator.produce(task) for attempt in range(MAX_REVISIONS): verdict = verify(draft) # tests / schema / tool — NOT the model if verdict.ok: return draft critique = critic.analyze(task, draft, verdict.failures) draft = generator.revise(task, draft, critique) return draft # budget exhausted: return best, flag low-confidence
The verify() function is the entire ballgame. If it is a real external check — a compiler, a test suite, JSON-schema validation, a tool that succeeds or errors — reflection compounds toward correctness. If verify() is just another model call grading the draft, you are in self-refine territory, and the guarantees weaken sharply.
When reflection pays off.
- Checkable outputs. Code (does it compile and pass tests?), structured data (does it validate?), tool-call arguments (did the tool accept them?), constrained formats. Here the critic has ground truth and revisions converge.
- Tasks with clear, articulable quality rubrics. "Does this summary cover all five required sections?" is checkable enough that even a model critic adds value, because the rubric is concrete rather than a vibe.
- High-value, latency-tolerant work. One reflection round roughly doubles latency and token cost. It is worth it for a contract draft or a migration script; it is not worth it for an autocomplete suggestion.
When reflection makes things WORSE.
This is the part most teams skip and then ship a regression. Self-critique without an external signal has documented pathologies:
- No reliable self-correction signal. On open-ended reasoning, studies show models often cannot tell whether their own answer is right. Asked to critique a correct answer, the model frequently "finds" a flaw and revises a right answer into a wrong one. Net quality can drop.
- Sycophantic or rubber-stamp critique. The critic says "looks good" regardless, adding latency and cost with zero quality gain. Common when generator and critic share the same model and prompt framing.
- Critique that drifts off-task. Open-ended feedback ("make it better") yields revisions that change style, not correctness, sometimes degrading a fine output.
- Convergence to confident wrong. Iterated self-reflection can entrench an early mistake, with the model rationalizing the error more fluently each round.
Rule of thumb: if you cannot point to the external signal the critic consults, do not ship a reflection loop. A model grading itself on a task it cannot verify is theater that costs 2x latency. The fix is almost always "build a verifier," not "improve the critic prompt."
Design guidance that works in production.
- Ground the critic in something external. Run the tests, validate the schema, execute the tool. Feed concrete failures, not "review this," into the critic.
- Separate generator and critic. Different prompt, ideally a different model or a strict structured rubric. A critic that shares the generator's blind spots cannot see them.
- Bound revisions hard (1–3). Gains are front-loaded; beyond a few rounds you mostly see drift and entrenchment. Stop on first pass or budget exhaustion.
- Make "no change needed" a first-class verdict. The critic must be able to ratify, with the bar that ratification requires the external check to pass.
- Persist Reflexion-style notes when episodes recur. If the same agent retries similar tasks, store the post-mortem and prepend it. This is where reflection's largest cross-episode gains live.
The honest tradeoff: reflection buys higher quality on verifiable tasks at roughly multiplicative latency/cost, and it buys a quality regression on tasks you cannot verify. It is the pattern most often deployed in the wrong place because "let the model double-check itself" sounds free. It is not free, and on unverifiable tasks it is negative.