Cost, quality & latency

Concepts · The AI Model & Tooling Ecosystem

Model size and the cost / quality / latency triangle.

This entry teaches the single trade-off that dominates production model selection: you cannot maximize quality, minimize cost, and minimize latency at the same time. You will leave with a working model of model "size," what frontier vs small actually buys you, and how to engineer around the triangle instead of pretending it does not exist.

STEP 1

"Size" is a proxy, and a leaky one.

Loosely, a bigger model (more parameters, more training compute) tends to be more capable, slower per token, and more expensive per token. But parameter count is an unreliable headline: architecture, training data quality, and training compute matter as much, and many models do not publish a parameter count at all. Providers instead expose tiers — typically a small/fast tier, a balanced mid tier, and a frontier tier — and those tier labels are a better practical handle than any number.

The durable intuition: think in tiers, not parameters. A current mid-tier model often matches a previous-generation frontier model at a fraction of the cost and latency. That generational drift is why "use the biggest model" is rarely the right default — last year's flagship capability is this year's cheap tier.

STEP 2

The triangle.

QUALITY (accuracy, reasoning depth) /\ / \ / \ / \ / \ / \ COST /____________\ LATENCY ($ / token) (time to answer) Pick a point inside the triangle, not a corner. Push toward QUALITY → bigger / reasoning model → more $, slower. Push toward LOW COST → smaller model → quality risk. Push toward LOW LATENCY → smaller / no-thinking → quality risk. The job is to find the cheapest, fastest model that is STILL good enough on YOUR eval — not the best model.

The classic mistake is optimizing one corner in isolation: picking the highest-quality model and discovering the latency is unusable for an interactive product, or picking the cheapest and shipping a quality regression users notice. The right framing is a constraint-satisfaction problem: given my latency budget and cost ceiling, what is the highest quality I can get? — measured on a representative evaluation set, not a public benchmark.

STEP 3

Each axis, concretely.

Cost

Priced per token, usually with input cheaper than output, and reasoning/"thinking" tokens billed too. Cost scales with model tier and with how much context you carry — a small model fed huge prompts every turn can cost more than a large model fed tight ones. Cost optimization is often a context-engineering problem as much as a model-choice problem.

Latency

Two numbers matter: time-to-first-token (responsiveness) and total generation time (throughput). Bigger models and reasoning modes increase both. For interactive UX, time-to-first-token plus streaming usually matters more than raw total time; for batch jobs, total throughput dominates and latency barely matters.

Quality

Not one number. A small model can equal a frontier model on routine extraction or classification while falling apart on multi-step reasoning. "Good enough" is task-specific, which is exactly why a public leaderboard cannot answer it for you and your own eval set can.

STEP 4

Engineering around the triangle.

You do not have to accept one global point. The strongest production pattern is routing / cascading: send every request to a cheap, fast model first; escalate to a larger or reasoning model only when the cheap one is uncertain or the task is detected as hard. Most real traffic is easy, so most requests resolve on the cheap path and only the minority that need it pay the frontier cost and latency.

Tiered by step. Inside an agent, use the cheapest model that works per step: a small model to classify or route, a reasoning model only for the genuinely hard planning step, a mid model to write the final answer.
Cache aggressively. Prompt caching collapses the cost of stable context, often changing which tier is economical.
Right-size, then escalate. Start from the smallest tier that passes your eval and move up only where it measurably fails — the opposite of starting at the frontier and trying to cut costs later.

The mature default is not "the best model" and not "the cheapest model" — it is the smallest model that still passes your evaluation, with a cheap escalation path for the cases it fails. That sentence is most of production model economics.