Few-shot prompting & examples

B3
Concepts · Core Building Blocks

Few-shot prompting & examples.

Some tasks are easier to show than to describe. Few-shot prompting puts a handful of input→output examples directly in the prompt so the model infers the pattern instead of being told it in prose. This entry covers when examples beat instructions, how to choose and order them, the failure modes that quietly poison results, and where few-shot stops paying off.

STEP 1

Zero-shot, one-shot, few-shot.

Zero-shot is instruction only — "Classify the sentiment." One-shot adds a single demonstration. Few-shot adds several. The model is not retrained by these examples; nothing about its weights changes. It conditions on them the same way it conditions on any other tokens — they shift the probability distribution for the next tokens toward outputs that match the demonstrated pattern. This is in-context learning: pattern inference at inference time, not training.

Few-shot classification prompt:

Review: "Arrived broken and support ignored me."   Sentiment: negative
Review: "Works fine, nothing special."             Sentiment: neutral
Review: "Best purchase this year, flawless."       Sentiment: positive
Review: "Cheap feel but it does the job."          Sentiment:  <- model continues here

The model has seen the exact shape of the answer three times. It will almost certainly emit one of negative / neutral / positive and then stop, because that is what the pattern demands. Compare that reliability to a zero-shot prompt where the model might reply "This seems somewhat positive overall." Examples pin the format and the label space simultaneously.

STEP 2

When examples beat instructions.

  • The output format is fiddly. A specific JSON shape, a particular citation style, a fixed table layout — far easier to demonstrate once than to specify in words.
  • The task is "I know it when I see it." Tone matching, classifying borderline cases, applying an implicit house style. The decision boundary lives in the examples, not in any rule you could write.
  • Edge-case behavior must be nailed down. Include an example of the tricky input and the exact handling you want, and the model generalizes from it.
  • You want a label space honored. Three labelled examples teach "only these labels exist" more reliably than the sentence "use only these labels."

Conversely, if a one-line instruction already works on your eval set, do not add examples. They cost tokens on every call, and unnecessary examples can narrow the model's behavior more than you intended.

STEP 3

Choosing and ordering examples.

Coverage over volume

Three well-chosen examples that span the decision space beat ten near-duplicates. If you have classes A/B/C, show all three. If there is a notorious confusable pair, include the example that disambiguates it. The examples are a tiny teaching set — design them like one.

Match the real distribution

If 90% of your production inputs are short and messy but all your examples are long and clean, the model is being taught the wrong distribution. Draw examples from inputs that look like what you will actually receive.

Order effects are real

Models are sensitive to example order, with the last example exerting noticeable pull (a recency effect). Two practical consequences: do not put all of one class first and the other class last, or you bias predictions toward the last class; and re-shuffle order when you A/B prompts so you are measuring the prompt, not an ordering artifact.

Label imbalance in the examples leaks into predictions. If your few-shot block is five "spam" and one "not-spam", the model will over-predict "spam" — it inferred a prior from your example mix. Balance the demonstrated labels unless the real prior genuinely is skewed and you want it reflected.

STEP 4

The quiet poison: a wrong example.

Few-shot's biggest risk is not too few examples — it is one mislabeled example. The model treats your demonstrations as ground truth. A single example with the wrong answer teaches the wrong pattern, and it does so invisibly: the prompt looks fine, the bug shows up only as a quality dip on a subset of inputs.

Audit example correctness harder than you audit instruction wording. A typo in the prose is harmless; a wrong label in a demonstration is a learned error injected into every single call that prompt makes.

A related subtlety from in-context-learning research: the format and label space of examples carry much of the benefit, sometimes more than perfectly correct content. This does not license sloppy labels — wrong labels still hurt — but it explains why even rough examples that show the right shape help a lot, and why the structure of your examples deserves as much care as their content.

STEP 5

Where few-shot stops paying off.

Few-shot is powerful but not free or unlimited:

  • Token cost is paid every call. A 12-example block prepended to every request is real recurring spend and latency. If it is static, prompt caching recovers most of the cost; if it changes per request, it does not.
  • Diminishing returns. Going from 0→2 examples is often a large jump; 8→16 rarely is. Measure where your curve flattens and stop there.
  • It does not add knowledge. Examples teach a pattern, not facts. If the task needs information the model lacks, that is a retrieval problem (see the RAG entries), not a few-shot problem. Trying to teach a knowledge base through examples is a category error.
  • Dynamic few-shot is the scaling move. Instead of a fixed block, retrieve the few examples most similar to the current input from a labelled pool. This keeps the prompt small and the examples maximally relevant — and it is literally retrieval applied to demonstrations, tying few-shot to the embeddings and vector-search building blocks.
Takeaway

Deliverable

You reach for few-shot when the task is easier to show than to state — fiddly formats, fuzzy boundaries, honored label spaces — and you stay zero-shot when an instruction already passes your evals. You choose examples for coverage and realistic distribution, balance their labels, randomize order when comparing prompts, and audit every demonstration for correctness because a wrong example is an invisible learned bug. And you know the ceiling: examples teach patterns, not knowledge, with diminishing returns — past that point the answer is retrieval, not more examples.