Choosing a model: a checklist

Concepts · The AI Model & Tooling Ecosystem

Choosing a model for a use case: a practical checklist.

This is the synthesis entry for the whole topic. It turns the preceding ideas — families, open vs closed, modalities, the cost/quality/latency triangle, reasoning, serving, and benchmark literacy — into a repeatable decision procedure you can apply to any use case and re-apply when the landscape moves.

STEP 1

Start from the task, never from the model.

The most common failure is picking a model first ("we'll use the best one") and retrofitting the use case to it. Invert it. Write down, before naming any model: what goes in and out (modalities), how hard the core task actually is, the latency users will tolerate, the per-request cost ceiling at expected volume, and where the data is allowed to go. These constraints, not a leaderboard, eliminate most of the candidate space before quality even enters.

STEP 2

The checklist, in order.

┌──────────────────────────────────────────────────────────────┐ │ MODEL SELECTION CHECKLIST (run top to bottom) │ │ │ │ 1. MODALITY What inputs/outputs are required? │ │ → eliminates models that physically can't. │ │ 2. DATA Where may inference data go? │ │ → may force open-weight / in-boundary host. │ │ 3. DIFFICULTY Does the core task need real reasoning? │ │ → decides reasoning vs direct, tier floor. │ │ 4. LATENCY Interactive or batch? TTFT budget? │ │ → caps model size / thinking budget. │ │ 5. COST $ ceiling per request at expected volume? │ │ → caps tier; informs self-host vs API. │ │ 6. SHORTLIST Use benchmarks ONLY to pick 2–4 candidates │ │ that survive constraints 1–5. │ │ 7. EVALUATE Run YOUR eval set on the shortlist; │ │ score quality + cost + latency together. │ │ 8. DECIDE Smallest/cheapest model that passes, │ │ with a cheap escalation path for failures. │ └──────────────────────────────────────────────────────────────┘

Steps 1–5 are hard constraints — they remove options. Step 6 is the only place public benchmarks belong: as a cheap filter, not a decision. Steps 7–8 are where the actual decision is made, on evidence from your own data.

STEP 3

Default to the smallest thing that works, then escalate.

The mature posture is the opposite of "start at the frontier and cut costs later." Start from the smallest, cheapest, fastest tier that plausibly fits, prove it passes your eval, and move up only where it measurably fails. Most real workloads are dominated by easy requests; sizing for the rare hard ones means overpaying on every easy one. Where hard cases exist, handle them with a routing/escalation path — cheap model first, frontier or reasoning model only on detected difficulty — rather than by upgrading the whole workload.

Per-step sizing inside agents. Different steps have different difficulty; do not pick one model for the whole loop. Cheap model to route, reasoning model only for the hard step, mid model to synthesize.
Abstract the model behind a seam. Make a model swap a config change, not a refactor. This is what keeps you able to act on the next release without a rewrite.
Re-run, do not re-feel. When you suspect a new model is better, re-run the eval. Vibes are not evidence; the noise band is real.

STEP 4

Designing for a field that will not hold still.

Every specific recommendation in this section is perishable; the procedure is not. Models will get cheaper and more capable, the open/closed gap will keep oscillating, modalities and reasoning controls will keep expanding, and today's frontier will be next year's cheap tier. A system designed well for that reality has three properties: the model is swappable behind a thin interface, a representative evaluation set is the arbiter of every change, and the team tracks the field at the level of structural shifts rather than individual launches.

If you remember one sentence from this entire topic: choose the smallest model that passes your own evaluation, keep it swappable behind a seam, and let your eval set — not a leaderboard or a launch announcement — decide every change. That discipline outlasts every model on the market today.