Hybrid Search & Reranking

R3
Deep Dive · Retrieval & RAG

Hybrid search and reranking: the production retrieval stack.

A single retriever is almost never enough. Dense embeddings paraphrase well and fail on rare tokens; lexical search nails exact strings and fails on synonyms; a single first stage that has to be both broad and precise is a contradiction. The 2025–2026 default for serious RAG systems is a two-stage stack — a cheap fused retrieve that maximizes recall, followed by an expensive cross-encoder rerank that maximizes precision. This entry is about why that shape wins, how to wire it, and the tuning levers that actually move the score.

STEP 1

Why one retriever is structurally not enough.

Dense (embedding-based) retrieval and sparse (lexical, BM25) retrieval fail in opposite ways, and the failures are not noise — they are properties of the algorithm.

  • Dense retrieval — the query and each chunk become vectors, similarity is dot product or cosine. It matches "how do I reset my password" to "account recovery steps" with no shared words. It also confidently mis-matches rare strings: a product SKU like A7-552-Q, an error code ECONNRESET, a person's name not in pretraining — these collapse to nearly identical vectors and the right chunk does not surface. Dense retrievers are also distributionally fragile: queries that look unlike anything in the embedding model's training data drift in unpredictable directions.
  • Sparse retrieval (BM25, SPLADE-style learned-sparse) — classic TF·IDF with length normalization. It is exact on tokens, robust on rare terms, and explainable. It also misses every paraphrase: "cannot log in" and "authentication failure" share zero tokens and BM25 sees no match.

You can paper over either failure with more k, but you cannot fix the failure mode. The right move is to combine: run both, fuse the lists, then let a stronger second stage sort the union. This is the same logic as a CPU cache — cheap-and-wide first, expensive-and-narrow second — applied to retrieval.

The fast diagnostic: take ten queries your system gets wrong and grep the corpus for the answer string. If the answer string is in the corpus but not in the top-50 retrieved set, your retriever is the bug, not your prompt. Then ask which retriever would have found it. Mostly synonyms → you need dense. Mostly exact strings, IDs, code → you need lexical. Mixed → you need both.

STEP 2

Fusion: combining lexical and dense without tuning weights.

The clean way to merge two ranked lists is reciprocal rank fusion (RRF, Cormack et al., 2009). Each document gets a score that depends only on its rank in each list, not on the raw retrieval score — which means you do not have to calibrate BM25 scores against dense cosines (they live on different scales and cannot be added).

# reciprocal rank fusion across N ranked lists
def rrf(ranked_lists, k=60):
    scores = {}
    for lst in ranked_lists:
        for rank, doc_id in enumerate(lst):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

# typical usage: top-100 from each retriever, fuse, take top-50 for rerank
fused = rrf([bm25_results[:100], dense_results[:100]])[:50]

The k=60 constant is from the original paper and is conventional — it controls how quickly score decays with rank. RRF beats linearly-weighted fusion in practice because it is parameter-free across retrievers: you do not retune when the BM25 score distribution shifts after a corpus change. For three or more sources (e.g., BM25 + dense + a learned-sparse retriever like SPLADE), the same formula scales by passing more lists.

Fusion is a recall move. The fused top-50 reliably contains the right passage when either retriever would have found it alone — but it is still a noisy list with the right answer often sitting around rank 20–40. Sorting that noise is the next stage's job.

STEP 3

Reranking: cross-encoders, where the precision comes from.

First-stage retrievers compute the query embedding and each document embedding independently — a bi-encoder — which is the only way to precompute the corpus index and search millions of vectors in milliseconds. The cost of that speed is that the model never sees the query and a candidate together, so it cannot perform fine-grained matching ("does this specific clause actually answer this specific question").

A cross-encoder takes the query and one candidate as a single concatenated input, runs full self-attention across both, and outputs a relevance score. It is dramatically more accurate than a bi-encoder for the same model size, because every query token can attend to every document token. The cost is also dramatic: it must run once per (query, candidate) pair, so it is too slow to score the whole corpus. The pattern that wins is therefore two-stage:

# two-stage retrieval: cheap recall, then expensive precision
def retrieve_then_rerank(query, k_first=50, k_final=5):
    # stage 1: fused bi-encoder + BM25, optimized for recall
    candidates = hybrid_retrieve(query, k=k_first)

    # stage 2: cross-encoder rescores every (query, candidate) pair
    pairs  = [(query, c.text) for c in candidates]
    scores = cross_encoder.predict(pairs)        # ~50 forward passes

    reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in reranked[:k_final]]

Empirically, adding a cross-encoder rerank on top of a mediocre first stage usually beats every other "advanced RAG" trick measured in isolation. The relevant benchmarks (BEIR, MS MARCO) have shown this consistently since 2020, and it survived the embedding-model upgrade cycle. If a naive RAG system is underperforming and you have one upgrade to make, this is the one with the best ROI — ahead of HyDE, multi-query, fine-tuning embeddings, or graph indexing.

The cross-encoder is not free at runtime. A 50-candidate rerank with a ~100M-parameter cross-encoder is on the order of 50–200ms on a small GPU per query; latency adds linearly with candidate count. If you are latency-bound, the lever is fewer candidates, not a smaller model — cross-encoder quality drops sharply when you shrink the model, while shrinking k_first from 100 to 30 often costs less than 1–2 recall points.

STEP 4

Late interaction (ColBERT) when cross-encoders are too slow.

Cross-encoders run full attention per pair. Late-interaction retrievers — ColBERT (Khattab & Zaharia, 2020) and its descendants ColBERTv2 and PLAID — split the difference: they embed each query token and each document token independently (so the document index is precomputable), then at query time compute a MaxSim score that, for each query token, takes the maximum similarity over document tokens and sums those maxima.

The result is finer-grained than a bi-encoder (which collapses each side to one vector) and much faster than a cross-encoder (no per-pair forward pass). The cost is index size: storing one vector per token, not one per document, is roughly an order of magnitude larger on disk. PLAID and product-quantized variants narrow that gap.

When is late interaction worth it? When you need most of cross-encoder quality at retrieval-time latencies, typically on large corpora where you cannot afford to run a cross-encoder over even 50 candidates. For most teams shipping today, the right path is still BM25 + dense + cross-encoder rerank, and ColBERT-class retrievers are a serious option once you outgrow that pipeline's latency budget or want to drop the rerank stage entirely.

STEP 5

Tuning the stack: the knobs that actually move recall@k.

Most teams underestimate the first-stage recall budget and overestimate the rerank's ability to recover. The rerank can only reorder what was retrieved — if the right chunk is at rank 200, no reranker over the top-50 will find it. The hierarchy of levers, in order of usual impact:

  • First-stage k. Going from k_first=10 to k_first=50 typically lifts answer-quality more than swapping the embedding model. Aim for the smallest k_first at which recall@k_first plateaus, then rerank from there. Measure this directly — do not guess.
  • Adding BM25 to a dense-only system (or vice versa). On corpora with a lot of named entities, codes, or numeric identifiers, this single change is often worth 10–20 points of recall@10. It costs almost nothing — modern OpenSearch / Elastic / Postgres-pgvector setups can serve both from one query.
  • Reranker model choice. Cross-encoder quality varies widely; a strong open model (e.g., BGE-reranker, Cohere rerank, mxbai-rerank) is typically a real upgrade over a small generic one. The gap between "any rerank" and "no rerank" is bigger than the gap between rerankers.
  • Chunk size and overlap. Covered in chunking and vector search. Smaller chunks raise lexical precision but split answers; larger chunks improve coherence but blur similarity. Default 300–500 tokens with ~10% overlap is a defensible starting point, but it is a per-corpus tune.
  • Score thresholds and abstention. If the cross-encoder's top score is below a calibrated threshold, return "insufficient evidence" rather than the best of a bad list. This is one of the highest-leverage interventions for trust and is usually skipped.
STEP 6

When the two-stage stack is overkill.

Not every system needs this. Skip the rerank when:

  • The corpus is small enough (a few hundred documents) that you can put the whole thing in a long-context prompt — see the RAG-vs-long-context discussion in what is RAG. The retrieval problem disappears.
  • Queries are simple keyword lookups (filename, ticket number, dictionary entry). BM25 alone is sufficient and faster.
  • You are p99-latency-bound below ~200ms and the cross-encoder will not fit the budget. Compensate with a better first stage — learned-sparse (SPLADE) or a domain-fine-tuned dense retriever — rather than serving a degraded rerank.

The honest summary: hybrid retrieve + cross-encoder rerank is the default because the failure modes of any single retriever are structural, the fusion step is parameter-free, and the rerank stage attacks the precision problem at the only place precision is actually decidable — with the query and the candidate visible to the same model at the same time. Build this stack, measure recall@k_first and final answer quality separately (see debug RAG in two halves), and then ask whether anything more elaborate is justified by the gap that remains.