Advanced RAG Architectures

R1
Deep Dive · Retrieval & RAG

Advanced RAG architectures: from retrieve-once to an agent that controls retrieval.

Naive RAG fails in a specific, dangerous way: it retrieves the wrong chunks, splices them in with confident framing, and the model produces a fluent wrong answer. Every advanced technique below — corrective loops, self-critique, query transformation, fusion, reranking, full agentic RAG — is a different attack on that same failure. The unifying lesson is that none of them work without a trustworthy scorer deciding what counts as good retrieval.

STEP 1

The spectrum: naive → modular → agentic.

Naive RAG is one shot: embed the query, fetch top-k, stuff the context, generate. It works when the question maps cleanly to one passage and the corpus is clean. It breaks on multi-hop questions, ambiguous queries, distractor-heavy corpora, and anything where the right evidence is not lexically or semantically adjacent to the question.

Modular RAG decomposes the pipeline into swappable stages — query transformation, hybrid retrieval, fusion, reranking, compression, generation — each independently tunable. It is still a fixed pipeline: the same stages run in the same order regardless of the query.

Agentic RAG makes retrieval a decision the agent controls at runtime. The agent plans how many retrieval steps to take, which source or tool to query (vector store, SQL, web, a sub-agent), inspects what came back, reflects, and adapts — possibly re-querying with a different formulation or abandoning a dead source. The survey by Singh, Ehtesham, Kumar & Talaei Khoei (Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG, arXiv:2501.09136) organizes this space by agent cardinality, control structure, autonomy, and knowledge representation — useful as a map, but the practitioner takeaway is simpler: agentic RAG is RAG where a control loop, not a pipeline, owns retrieval. This is the same loop discussed in the ReAct deep dive, pointed at a knowledge base instead of generic tools.

STEP 2

Corrective RAG and Self-RAG: closing the loop on bad retrieval.

Corrective RAG (CRAG), from Yan et al. (2024), inserts a lightweight retrieval evaluator between fetch and generate. It scores the retrieved documents against the query and routes to one of three actions: Correct (confident — refine and use the docs), Ambiguous (mixed — combine the docs with a web-search supplement), or Incorrect (low confidence — discard the docs entirely and fall back to web search). The evaluator is deliberately cheap (CRAG used a roughly 0.8B model) so the correction step does not dominate latency.

Self-RAG (Asai et al., 2023) moves the decision into the model itself. The model is trained to emit reflection tokens that decide, on the fly, whether retrieval is needed at all, whether each retrieved passage is relevant, and whether its own draft is actually supported by the evidence. Where CRAG bolts an external evaluator onto an off-the-shelf model, Self-RAG bakes the retrieve-and-critique policy into the weights.

Both answer the same question — "should I trust what I just retrieved, and what do I do if I shouldn't?" — and both are only as good as the judgment signal. A miscalibrated evaluator that rates garbage as Correct turns CRAG into naive RAG with extra latency.

The evaluator is the load-bearing component, not the routing logic around it. Teams ship the branch structure and under-test the scorer, then wonder why correction never fires. Build a labeled set of (query, retrieved-docs, is-actually-relevant) and measure the evaluator's precision/recall before trusting any branch it controls. This is the same trustworthy-scorer requirement as the relevance floor in the Retrieval-augmented memory deep dive.

STEP 3

A CRAG-style evaluate-then-branch loop.

The whole family reduces to: retrieve, score, branch on the score, optionally recover, then generate — only on evidence that cleared the bar.

# corrective_rag.py — evaluate, then branch
def corrective_rag(query, retriever, evaluator, web, gen):
    docs = retriever.search(query, k=8)
    grade = evaluator.score(query, docs)   # 0..1 per corpus relevance

    if grade >= 0.70:                      # CORRECT
        evidence = refine(docs)            # drop distractor strips
    elif grade >= 0.35:                    # AMBIGUOUS
        evidence = refine(docs) + web.search(query)
    else:                                 # INCORRECT
        evidence = web.search(rewrite(query))  # discard, recover

    if not evidence:
        return "insufficient evidence"     # abstain > confabulate
    return gen.answer(query, evidence)

The thresholds are not universal constants — they are tuned per system against the labeled set from Step 2. The abstain path matters as much as the branches: a system that returns "insufficient evidence" on a genuinely unanswerable query is correct, and far safer than one that always produces something.

STEP 4

Query transformation: rewriting, multi-query, HyDE.

If the query is a bad probe into the index, no downstream stage recovers. Three common transforms:

  • Rewriting. Resolve pronouns, expand acronyms, strip conversational noise — turn "and what about its pricing?" into a standalone query. Low-risk, usually a net win for conversational RAG.
  • Multi-query. Generate several paraphrases, retrieve for each, and merge. Raises recall on under-specified questions at the cost of more retrieval calls.
  • HyDE (Hypothetical Document Embeddings). Have the model write a hypothetical answer, embed that, and retrieve with it — the synthetic answer is often closer in embedding space to real passages than the bare question is.

HyDE is not a default. It reliably helps weak or zero-shot retrievers, but on a strong fine-tuned dense retriever it can hurt: the hypothetical document introduces a distribution shift and a chance to hallucinate the wrong entities, dragging retrieval off-target. Treat every query transform as a hypothesis to measure on your retriever and corpus, not a best practice to adopt blindly. This is a fast-moving, system-dependent result — the honest answer is "A/B it."

STEP 5

Fusion and reranking: cheap recall, then expensive precision.

RAG-fusion / retrieval fusion runs multiple retrievals (different query formulations, or dense + lexical) and merges the ranked lists, typically with reciprocal rank fusion, so a document that ranks decently across several lists rises above one that spikes in a single noisy list. It is a recall-and-robustness move, not a precision one.

Precision comes from reranking. The two-stage pattern — a cheap bi-encoder (or BM25) retrieves a wide candidate set, then a cross-encoder that jointly attends to query and passage rescores the top ~50–100 — consistently beats any single-stage retriever. The cross-encoder is too slow to run over the whole corpus, which is exactly why it lives behind a cheap first stage: spend compute only where ranking actually decides the answer. In practice, two-stage retrieve-then-rerank is the highest-ROI upgrade for most underperforming naive-RAG systems, ahead of any of the cleverer techniques above. The agentic-RAG control loop here is conceptually a sibling of the patterns in the Agent search strategies deep dive and the Field Guide's Retrieval chapter — same evaluator-driven discipline, applied to knowledge instead of actions.

STEP 6

When to reach for which.

Start with strong naive RAG plus two-stage reranking and query rewriting — this clears most production cases and is the cheapest, most predictable configuration. Add fusion when recall is the bottleneck (the right doc exists but ranks below the cutoff). Add CRAG-style correction when your corpus has real coverage gaps and a web fallback is acceptable — it directly attacks confident-wrong-answer-from-garbage. Reach for Self-RAG when you can train or fine-tune and want the retrieve/critique policy in the model rather than in orchestration code. Go fully agentic only when retrieval is genuinely multi-step or multi-source — comparisons across documents, multi-hop reasoning, choosing among heterogeneous tools — because the planning loop adds latency, cost, and new failure modes (looping, drift) for no benefit on single-hop questions. Treat HyDE and aggressive query expansion as measured experiments, never defaults. Across all of it, the invariant holds: the architecture is only as trustworthy as the scorer that decides what good retrieval means — build and validate that first, choose the architecture second.