Retrieval — The Agentic AI Field Guide

1.2

Part I / Build · Week 2

Replace the toy search with a real stack.

Three layers, built in order: lexical baseline, dense semantic, rerank. Then a grounding verifier on top. By the end, every claim is traceable and verifiable.

STEP 1

Chunk the corpus. Boringly.

Before we can do anything smarter than substring matching, we need to split documents into searchable units. This is "chunking." It's tempting to overthink — semantic chunking, late chunking, recursive AST splitting — but for now, do the boring thing. We'll improve it only if evals say we should.

The rule of thumb

Target size: ~500 tokens per chunk (roughly 2000 characters of English).
Overlap: ~50 tokens between adjacent chunks. Catches cases where the answer straddles a chunk boundary.
Split on natural boundaries: paragraph breaks first, then sentences. Don't cut mid-sentence.
Preserve metadata: each chunk knows its doc_id and an in-document chunk_idx. We'll need these to cite back.

Note: this code is provider-agnostic. Chunking is just Python.

# retrieval/chunk.py
import tiktoken  # works for both providers as a counter
from dataclasses import dataclass

enc = tiktoken.get_encoding("cl100k_base")

def n_tokens(s: str) -> int:
    return len(enc.encode(s))

@dataclass
class Chunk:
    chunk_id: str   # f"{doc_id}::{idx}"
    doc_id: str
    idx: int
    text: str
    n_tok: int

def chunk_doc(doc_id: str, text: str,
              target: int = 500,
              overlap: int = 50) -> list[Chunk]:
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buf, buf_tok = [], [], 0

    def flush():
        nonlocal buf, buf_tok
        if not buf: return
        joined = "\n\n".join(buf)
        chunks.append(Chunk(
            chunk_id=f"{doc_id}::{len(chunks)}",
            doc_id=doc_id,
            idx=len(chunks),
            text=joined,
            n_tok=n_tokens(joined),
        ))
        # Keep last paragraph for overlap
        buf = [buf[-1]] if overlap and n_tokens(buf[-1]) <= overlap else []
        buf_tok = n_tokens(buf[0]) if buf else 0

    for p in paragraphs:
        t = n_tokens(p)
        if buf_tok + t > target and buf:
            flush()
        buf.append(p)
        buf_tok += t
    flush()
    return chunks

Run it and inspect

>>> from retrieval.chunk import chunk_doc
>>> from pathlib import Path
>>> text = Path("corpus/pgbouncer-modes.md").read_text()
>>> chunks = chunk_doc("pgbouncer-modes", text)
>>> for c in chunks[:3]:
...     print(c.chunk_id, c.n_tok, c.text[:50])

pgbouncer-modes::0 487 # PgBouncer Pool Modes\n\nPgBouncer supports
pgbouncer-modes::1 503 ## Session Pooling\n\nIn session pooling mode,
pgbouncer-modes::2 491 ## Transaction Pooling\n\nThis is the most c

Three chunks, all near 500 tokens, each preserving section headers. Good shape.

The contextual chunks trick

Before embedding chunks, prepend a one-sentence summary describing what each chunk is about within its parent document. This makes retrieval significantly better — Anthropic's contextual retrieval research reports ~35% improvement.

The cost is small: one cheap-model call per chunk, done once at index time. Use Haiku or GPT-4o-mini and batch them.

# retrieval/contextualize.py
from anthropic import Anthropic
client = Anthropic()

CTX_PROMPT = """Document title: {doc_id}
Full document:
<document>{full_text}</document>

Here is a chunk from the document:
<chunk>{chunk_text}</chunk>

In one sentence, situate this chunk in the document
(what section, what topic, what role). No preamble.
Output just the sentence."""

def contextualize(chunk_text, doc_id, full_text):
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=100,
        messages=[{"role": "user",
                   "content": CTX_PROMPT.format(
                       doc_id=doc_id,
                       full_text=full_text,
                       chunk_text=chunk_text)}],
    )
    return response.content[0].text.strip()

# retrieval/contextualize.py
from openai import OpenAI
client = OpenAI()

CTX_PROMPT = """Document title: {doc_id}
Full document:
<document>{full_text}</document>

Here is a chunk from the document:
<chunk>{chunk_text}</chunk>

In one sentence, situate this chunk in the document
(what section, what topic, what role). No preamble.
Output just the sentence."""

def contextualize(chunk_text, doc_id, full_text):
    response = client.responses.create(
        model="gpt-5-mini",
        input=CTX_PROMPT.format(
            doc_id=doc_id,
            full_text=full_text,
            chunk_text=chunk_text),
    )
    return response.output_text.strip()

What does the output look like? For a chunk from the PgBouncer transaction pooling section:

"This chunk explains transaction pooling in PgBouncer,
describing how server connections are released after
each transaction commits — a constraint that breaks
features requiring session state."

That one sentence now travels with every chunk through retrieval. When a user asks "why does my prepared statement break in PgBouncer?", the embedding includes "constraint that breaks features requiring session state" — which is semantically close to the question even if the exact word "prepared" isn't in the chunk.

Two alternatives worth knowing

The contextual-chunk trick is not the only way to give a chunk its document context. Two mainstream alternatives solve the same problem with a different tradeoff:

Parent-document retrieval. Index and search on small chunks (precise retrieval), but when a small chunk matches, feed the LLM its larger surrounding parent section instead. Simple, no extra model calls — a good fit when your chunks are too small to reason over on their own.
Late chunking. Embed the whole document at the token level first, then pool token embeddings into chunk vectors afterward. Each chunk vector inherits document-wide context from the surrounding tokens — with no per-chunk LLM call at all. Cheaper at index time than contextual retrieval, which spends one small-model call per chunk.

Rough rule: contextual retrieval is the strongest on dense technical docs (it writes a real, query-shaped summary); late chunking gets most of that benefit far cheaper when index-time cost or corpus size matters; parent-document is the simplest and shines when chunks are individually too small to answer from. Pick with evals in Phase 4 — don't assume.

Question

Why ~500 tokens? Why not 1000 or 2000?

Two competing pressures. Larger chunks preserve more context — useful for the model when it actually reads them. Smaller chunks are more precise in retrieval — a 2000-token chunk can be the top match because of one paragraph that happens to be similar, while the other 90% is irrelevant.

500 tokens is a sweet spot for technical docs. For code repos you might go smaller (250) since signal is denser. For literary or narrative text, larger (800–1200) keeps narrative flow.

Don't guess — measure in Phase 4. Try 300, 500, 1000; pick the one with best recall@5 on your eval set.

Question

How much does contextual chunking actually cost?

One small-model call per chunk at index time. For a 200-document corpus producing ~3000 chunks: roughly $1–3 of API spend total, completed in ~10 minutes if you parallelize. Index-time cost. At query time, retrieval is the same speed as without it.

The ROI is significant. If your evals show recall@5 going from 0.62 to 0.84 (a typical lift from contextual retrieval on technical docs), you've turned $2 of index-time spend into an agent that needs noticeably fewer follow-up searches at query time — saving model tokens forever.

STEP 2

Build hybrid search: BM25 + dense + RRF.

Now we replace the toy search_docs with something that actually understands what we're asking. Three components, built separately and then fused.

BM25 — the lexical baseline

BM25 is a statistical keyword-matching algorithm from the 1990s that, on technical docs, often beats fancy embedding-based search alone. Read that twice. People reach for embeddings reflexively and skip BM25 — that's a mistake. BM25 is your floor: it should never lose on queries where the user uses the same words the docs use.

pip install bm25s

# retrieval/bm25_index.py
import bm25s

class BM25Index:
    def __init__(self, chunks):
        self.chunks = chunks
        corpus = [c.text for c in chunks]
        self.retriever = bm25s.BM25()
        self.retriever.index(bm25s.tokenize(corpus, stopwords="en"))

    def search(self, query: str, top_k: int = 50) -> list[str]:
        q_tok = bm25s.tokenize(query, stopwords="en")
        results, scores = self.retriever.retrieve(q_tok, k=top_k)
        return [self.chunks[i].chunk_id for i in results[0]]

Dense embeddings — the semantic layer

Dense retrieval embeds chunks and queries into the same vector space, then finds chunks closest to the query. Catches paraphrases ("how to vacuum a table" matches "running VACUUM on a relation") that BM25 misses.

For embedding models, both providers offer good options. We'll use Voyage AI (Anthropic's recommended embedding partner) and OpenAI's text-embedding-3-large. The vector database is the same either way — we'll use ChromaDB locally.

pip install chromadb voyageai

# retrieval/dense_index.py — Voyage embeddings
import chromadb, voyageai
voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path=".chroma")

def embed(texts, input_type="document"):
    r = voyage.embed(texts, model="voyage-3",
                     input_type=input_type)
    return r.embeddings

class DenseIndex:
    def __init__(self, chunks, name="corpus"):
        self.col = chroma.get_or_create_collection(name)
        if self.col.count() == 0:
            embeddings = embed([c.text for c in chunks])
            self.col.add(
                ids=[c.chunk_id for c in chunks],
                embeddings=embeddings,
                documents=[c.text for c in chunks],
            )

    def search(self, query, top_k=50):
        q_emb = embed([query], input_type="query")[0]
        r = self.col.query(query_embeddings=[q_emb],
                           n_results=top_k)
        return r["ids"][0]

# retrieval/dense_index.py — OpenAI embeddings
import chromadb
from openai import OpenAI
oai = OpenAI()
chroma = chromadb.PersistentClient(path=".chroma")

def embed(texts):
    r = oai.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
    )
    return [e.embedding for e in r.data]

class DenseIndex:
    def __init__(self, chunks, name="corpus"):
        self.col = chroma.get_or_create_collection(name)
        if self.col.count() == 0:
            embeddings = embed([c.text for c in chunks])
            self.col.add(
                ids=[c.chunk_id for c in chunks],
                embeddings=embeddings,
                documents=[c.text for c in chunks],
            )

    def search(self, query, top_k=50):
        q_emb = embed([query])[0]
        r = self.col.query(query_embeddings=[q_emb],
                           n_results=top_k)
        return r["ids"][0]

Reciprocal Rank Fusion — combining the two

BM25 and dense both return ranked lists of chunk IDs. RRF combines them into a single ranking using a simple formula: for each result, score it as 1 / (k + rank) across all rankings, sum the scores, sort descending. k=60 is the standard default.

# retrieval/hybrid.py
def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, cid in enumerate(ranking):
            scores[cid] = scores.get(cid, 0.0) + 1.0 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

class HybridSearch:
    def __init__(self, bm25: BM25Index, dense: DenseIndex):
        self.bm25 = bm25
        self.dense = dense

    def search(self, query: str, top_k: int = 20) -> list[str]:
        bm = self.bm25.search(query, top_k=50)
        ds = self.dense.search(query, top_k=50)
        return rrf([bm, ds])[:top_k]

Quick comparison: what improves?

Run the same query — "when to use VACUUM versus VACUUM FULL" — through each retrieval method:

SUBSTRING (Phase 1):  found 0 docs ─ phrase doesn't appear literally

BM25 alone:
  1. routine-vacuuming::3     0.71   (contains "VACUUM FULL")
  2. sql-vacuum::0            0.65
  3. runtime-config-vacuum::1 0.41

DENSE alone:
  1. routine-vacuuming::3     0.89   (matches semantically)
  2. routine-vacuuming::4     0.86
  3. sql-vacuum::2            0.81

HYBRID (RRF):
  1. routine-vacuuming::3     0.0328
  2. routine-vacuuming::4     0.0163
  3. sql-vacuum::0            0.0162
  4. sql-vacuum::2            0.0161

What to notice

BM25 finds routine-vacuuming::3 at rank 1 because "VACUUM FULL" appears literally. Dense also ranks it first, but for a different reason — semantic similarity to the question.

Both being right doesn't make this query a great test of hybrid value. But on a paraphrased query like "how do I reclaim disk space after deleting rows" — dense will surface VACUUM FULL docs (the answer) while BM25 might miss entirely because "reclaim" and "disk space" don't appear in those docs.

Hybrid catches both cases. The cost is two index lookups instead of one — negligible.

Question

Do I really need both BM25 and dense? Can't dense embeddings do everything?

No. Dense embeddings have a specific failure mode: exact-term sensitivity. If the user types "PostgreSQL" and your docs say "Postgres", BM25 misses (different tokens) and dense usually catches (similar embeddings). Good. But if the user types "voyage-3" — the specific embedding model name — dense might surface chunks about "embedding models" generally, while BM25 nails the exact mention.

Production retrieval almost always wants both. The rule of thumb: BM25 when the user knows the right words; dense when they describe the thing in their own words; hybrid when you don't know which it is.

Question

Why k=60 in RRF?

It's the value from the original RRF paper (Cormack et al., 2009). The intuition: k controls how much rank position matters. Small k means top results dominate; large k means more uniform weighting. 60 has been the default for fifteen years because it works well across many datasets — not because it's optimal for any specific one.

You can tune it, but don't bother until evals tell you it matters.

STEP 3

Add a reranker over the top 20.

Hybrid search gets you decent recall — the right chunk is usually in the top 20. But it might be at position 8 when you only have time to show the agent 5. A reranker fixes this by re-scoring the top 20 using a more expensive but more precise model.

The architecture is two-stage: hybrid search runs cheap and broad (top 20 candidates from across the corpus), then the reranker runs slow and precise (re-orders those 20 into the best top 5). This is a standard pattern called "retrieve-then-rerank."

How rerankers differ from embeddings

An embedding model encodes query and chunk separately into vectors, then compares — fast but loses cross-attention between them. A reranker (a "cross-encoder") encodes both together and produces a score — slower per pair but much more accurate, because it can see exactly how each chunk matches each part of the query.

Three options for the reranker — pick one:

Cohere Rerank — managed API, ~50ms per 20 chunks. Most popular choice in production.
Voyage rerank-2 — managed API, similar quality and price.
BGE-reranker-v2 — open-weights model you run locally. Slower to set up; free at inference time.

pip install cohere

# retrieval/rerank.py — Cohere
import cohere
co = cohere.Client()

def rerank(query: str, chunks: list[Chunk],
           top_k: int = 5) -> list[Chunk]:
    if not chunks: return []
    response = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=[c.text for c in chunks],
        top_n=top_k,
    )
    return [chunks[r.index] for r in response.results]

Putting it all together

The full retrieval pipeline, behind a single function the agent calls:

# retrieval/__init__.py
from retrieval.bm25_index import BM25Index
from retrieval.dense_index import DenseIndex
from retrieval.hybrid import HybridSearch
from retrieval.rerank import rerank
from retrieval.chunk import chunk_doc

# Built once at import time
ALL_CHUNKS = load_or_build_chunks()
BY_ID = {c.chunk_id: c for c in ALL_CHUNKS}
hybrid = HybridSearch(BM25Index(ALL_CHUNKS), DenseIndex(ALL_CHUNKS))

def retrieve(query: str, top_k: int = 5) -> list[Chunk]:
    candidate_ids = hybrid.search(query, top_k=20)
    candidates = [BY_ID[cid] for cid in candidate_ids]
    return rerank(query, candidates, top_k=top_k)

Now wire it back into the agent

Replace the toy search_docs in agent/tools.py:

from retrieval import retrieve

def search_docs(query: str) -> list[dict]:
    chunks = retrieve(query, top_k=5)
    return [
        {
            "chunk_id": c.chunk_id,
            "doc_id": c.doc_id,
            "snippet": c.text[:300],
        }
        for c in chunks
    ]

def fetch_doc(chunk_id: str) -> dict:
    # Now we fetch a CHUNK, not a whole document.
    # Chunks are bounded ~500 tokens — safe for context.
    c = BY_ID.get(chunk_id)
    if not c: return {"error": "not found"}
    return {"chunk_id": chunk_id,
            "doc_id": c.doc_id,
            "text": c.text}

Note the change: fetch_doc now returns a single chunk instead of an entire document. We chose 500-token chunks for a reason — they're small enough to fit comfortably in context without dominating it. The agent can fetch several chunks across a multi-step investigation.

Now run the same VACUUM query from Phase 1

$ python scripts/run.py "When should I VACUUM versus VACUUM FULL?"

──────────────────── Step 0 ────────────────────
→ search_docs({'query': 'VACUUM versus VACUUM FULL'})
   returned: [
     {chunk_id: 'routine-vacuuming::3', snippet: 'VACUUM
       FULL rewrites the entire table and indexes,
       reclaiming disk space but requiring an ACCESS
       EXCLUSIVE lock. Regular VACUUM cannot...' },
     {chunk_id: 'routine-vacuuming::4', snippet: '...'},
     {chunk_id: 'sql-vacuum::0', snippet: '...'},
     ...
   ]

──────────────────── Step 1 ────────────────────
┌─ thinking ─────────────────────────────────────┐
│ The first snippet has the core distinction.    │
│ I'll fetch the full chunk for detail.          │
└────────────────────────────────────────────────┘
→ fetch_doc({'chunk_id': 'routine-vacuuming::3'})

──────────────────── Step 2 ────────────────────
→ submit_answer({...})

status: answered (3 steps)

Same question, dramatically different behavior

Phase 1: 5 steps, 3 redundant searches, lucky recovery. Phase 2: 3 steps, one search returned exactly the right chunk at rank 1.

The agent didn't get smarter. The retrieval got smarter, so the agent had less work to do. This is the right lesson: most "agent quality" improvements are actually retrieval improvements.

Question

Do I really need a reranker? Hybrid search is already pretty good.

On clean queries with obvious answers, hybrid alone is often enough. The reranker earns its keep on three cases:

Long-tail queries where the right chunk is at rank 8–15 in hybrid output. Reranker promotes it to top 3.
Ambiguous queries where multiple chunks look superficially relevant. Reranker is better at picking the one that actually answers the question vs ones that just share keywords.
Multi-aspect queries like "Performance impact of X on Y under condition Z" — embeddings score on overall similarity, rerankers can balance multiple aspects.

Measure with evals. If recall@5 stays flat when you add a reranker, your queries don't need one — skip it and save the latency.

STEP 4

Add a grounding verifier.

Retrieval gives the agent the right material to work with. But the agent might still hallucinate — claim something that's plausibly related to its citations but not actually supported by them. A grounding verifier catches this.

The verifier is a second model call that runs before submit_answer finalizes. It takes the answer's claims and the cited chunks, and returns a verdict: supported, partially supported, or not supported. Unsupported claims either get rewritten, get a citation correction, or trigger a retry.

The verifier prompt

Same prompt for both providers; only the API call differs.

VERIFIER_PROMPT = """You are a strict fact-checker.

Given an answer and the sources it cited, decide for each
factual claim in the answer whether the sources SUPPORT,
PARTIALLY support, or DO NOT SUPPORT the claim.

Output JSON only:
{
  "claims": [
    {
      "claim": "the exact claim from the answer",
      "verdict": "SUPPORT" | "PARTIAL" | "NOT_SUPPORTED",
      "evidence": "quote from a source, or empty",
      "reasoning": "one sentence"
    }
  ]
}

Be strict. If a claim is more specific than what
sources actually say, mark it PARTIAL. If a claim
isn't mentioned in sources, mark it NOT_SUPPORTED."""

# grounding/verify.py
import json
from anthropic import Anthropic
client = Anthropic()

def verify(answer: str, sources: list[dict]) -> dict:
    sources_block = "\n\n".join(
        f"[{s['chunk_id']}]\n{s['text']}"
        for s in sources
    )
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=VERIFIER_PROMPT,
        messages=[{
            "role": "user",
            "content": f"ANSWER:\n{answer}\n\n"
                       f"SOURCES:\n{sources_block}",
        }],
    )
    return json.loads(response.content[0].text)

# grounding/verify.py
import json
from openai import OpenAI
client = OpenAI()

def verify(answer: str, sources: list[dict]) -> dict:
    sources_block = "\n\n".join(
        f"[{s['chunk_id']}]\n{s['text']}"
        for s in sources
    )
    response = client.responses.create(
        model="gpt-5-mini",
        instructions=VERIFIER_PROMPT,
        input=f"ANSWER:\n{answer}\n\n"
              f"SOURCES:\n{sources_block}",
        text={"format": {"type": "json_object"}},
    )
    return json.loads(response.output_text)

What the verifier returns

{
  "claims": [
    {
      "claim": "Transaction pooling rotates server
                connections between transactions",
      "verdict": "SUPPORT",
      "evidence": "transaction pooling releases server
                   connections back to the pool after
                   each transaction commits",
      "reasoning": "Source directly states this."
    },
    {
      "claim": "Prepared statements are scoped to a
                session",
      "verdict": "SUPPORT",
      "evidence": "PREPARE creates a prepared statement
                   for the current session only",
      "reasoning": "Source explicitly states session scope."
    },
    {
      "claim": "This conflict is the most common cause
                of PgBouncer migration failures",
      "verdict": "NOT_SUPPORTED",
      "evidence": "",
      "reasoning": "Sources discuss the conflict but
                    make no claim about migration
                    failures or their frequency."
    }
  ]
}

What just happened

The agent's answer had three claims. Two were directly supported by the cited chunks. One — "the most common cause of migration failures" — was an embellishment the model added on its own, not actually present in the sources.

Without the verifier, that hallucination would have shipped. With it, we can either drop the unsupported claim, mark it as speculative, or send the agent back to find a source.

Wiring it into the loop

Modify the loop to verify before returning a successful answer:

# in agent/loop.py, when submit_answer is called:
if tool_name == "submit_answer":
    sources = [BY_ID[cid] for cid in citations
               if cid in BY_ID]
    verdict = verify(answer, [{
        "chunk_id": s.chunk_id,
        "text": s.text,
    } for s in sources])

    unsupported = [c for c in verdict["claims"]
                   if c["verdict"] == "NOT_SUPPORTED"]

    return {
        "status": "answered",
        "answer": answer,
        "citations": citations,
        "verification": verdict,
        "unsupported_claims": len(unsupported),
    }

For now we just attach the verdict to the result — we can decide what to do with it (retry, warn, strip) based on what evals show in Phase 4.

The verifier is your first primitive, not feature. You'll reuse it in Phase 3 (verifying subagent outputs) and Phase 4 (it's basically a per-claim eval). Build it well.

Question

Doesn't running a verifier on every answer double my API costs?

Roughly: it adds one extra call per successful answer, with input bounded by (answer + cited chunks) ≈ 2–3k tokens. Using Haiku or gpt-5-mini, that's a fraction of a cent per query.

Compare to the cost of not verifying: a user trusts a hallucinated answer, makes a bad decision, and never trusts the agent again. The verifier is cheap insurance.

Question

What if the verifier itself hallucinates?

It can. The verifier is also an LLM. But its task is much narrower — does this claim appear in these sources? — and that's a task where models are surprisingly reliable, especially when prompted to quote evidence.

You'll measure the verifier itself in Phase 4 by hand-labeling 30 (claim, sources, verdict) triples and checking agreement. Below ~80% agreement with humans, the verifier is doing harm rather than good and you tune the prompt until it improves.

End of week 2

Deliverable

An agent whose answers cite specific chunks, with a verifier that confirms each citation actually supports its claim. Side-by-side comparison: same 10 questions from Phase 1, dramatically different behavior.

Chunker with contextual one-line summaries
BM25 + dense + RRF hybrid search
Cross-encoder reranker on top-20 → top-5
Grounding verifier with claim-level verdicts
A/B trace comparison: Phase 1 vs Phase 2