Structured outputs

Concepts · Core Building Blocks

Structured outputs.

A chatty paragraph is useless to the code that called the model. Structured output is the practice of making the model emit machine-parseable data — usually JSON conforming to a schema — reliably enough to feed straight into a program. This entry covers why "just ask for JSON" is not enough, the spectrum from prompting to schema-constrained decoding, and how structured output underpins tool calling and agent pipelines.

STEP 1

The problem: prose does not compose.

Models are trained to produce fluent natural language. That is exactly wrong when the consumer is software. If you classify a ticket and the model replies "This looks like a billing issue, probably high priority since they mentioned a failed payment," your code now has to parse intent out of prose — brittle, and it breaks the day the model phrases it differently. To compose a model into a pipeline you need outputs with a fixed, predictable shape:

UNUSABLE
"This looks like a billing issue, probably high priority."

USABLE
{ "category": "billing", "priority": "high" }

STEP 2

Why "just ask for JSON" is not enough.

Asking nicely in the prompt ("Respond only with JSON like {...}") works most of the time, and "most of the time" is the problem. The model still samples token by token, so it occasionally:

wraps the JSON in ```json fences or a "Here is the JSON:" preamble;
adds a trailing comment, emits a trailing comma, or uses single quotes;
invents a key, drops a required one, or returns a number as a string;
truncates mid-object when it hits max_tokens, producing unparseable fragments.

At 1% failure across 100k daily calls that is 1,000 broken records a day. Prompting for structure raises reliability; it does not guarantee it. You need either enforcement or defensive parsing — ideally both.

STEP 3

The reliability spectrum.

WEAKEST  ->  STRONGEST

1. Prompt + few-shot
   "Reply only with JSON: {...}" plus example outputs.
   Easy; ~95-99% valid; needs a parse-and-retry safety net.

2. JSON mode
   Provider flag forcing syntactically valid JSON.
   Valid JSON guaranteed -- correct SCHEMA is not.

3. Schema-constrained / structured outputs
   Pass a JSON Schema; decoding is constrained so only
   tokens that keep output schema-valid can be sampled.
   Output is valid JSON AND matches your schema.

4. Tool / function calling
   Define a function with a typed parameter schema; the
   model's "arguments" object is structured output by
   another name -- same machinery, different doorway.

The key idea behind levels 3–4: constrained decoding. At each generation step the sampler is restricted to tokens that keep the output a valid prefix of something the schema allows. The model literally cannot emit a missing brace or a misspelled key — invalidity is excluded at the token level, not cleaned up afterward. This connects directly to the tool-calling entry: a tool call's input object is structured output aimed at a function instead of at your application.

STEP 4

Designing schemas the model can satisfy.

Constrained decoding guarantees the shape, never the truth. The model can return {"priority": "high"} for a trivial ticket — perfectly valid, semantically wrong. Schema design influences how often the content is right:

Enums over free strings. "priority": "low"|"med"|"high" beats a free-text priority. The model cannot drift to "kinda urgent," and you do not normalize downstream.
Make the model show its reasoning first. A reasoning string field before the answer field lets the model think before committing — order matters because it generates left to right, so a verdict placed before its justification is a guess.
Allow "unknown" explicitly. If there is no nullable or "unknown" option, a forced schema forces the model to fabricate a value. Give it a legal escape hatch and confabulation drops.
Keep schemas shallow and named. Deeply nested, ambiguous schemas degrade content quality even when syntax is enforced. Clear field names act as inline instructions.

Enforced JSON validity is not validation. The schema cannot check that an email is real, an ID exists, or a total adds up. Always validate semantics in code after parsing — schema conformance is necessary, never sufficient.

STEP 5

Be defensive even with enforcement.

Even with strong enforcement, treat parsing as a place that fails:

Reserve output tokens. Truncation is the top cause of broken structured output. Size max_tokens for the largest valid object and verify it fits the context budget.
Parse-and-retry. If parsing or validation fails, send the error back and ask the model to fix it. One retry recovers most failures.
Validate against the schema in code anyway. Belt and suspenders: never trust that the upstream constraint held in every edge case.
Lower temperature. Structured tasks rarely benefit from creativity; temperature 0–0.3 reduces format drift.

Takeaway

Deliverable

You know prose does not compose into software and that prompting for JSON raises but does not guarantee reliability. You can place a technique on the spectrum — prompt, JSON mode, schema-constrained decoding, tool calling — and you understand constrained decoding excludes invalid tokens at generation time rather than cleaning up after. You design schemas that improve content (enums, reasoning-before-answer, explicit unknowns, shallow named fields), and you stay defensive regardless: reserve output room, validate semantics in code, and parse-and-retry. This is the same machinery as tool calling, pointed at your application instead of a function.