Agentic retrieval: making search a tool, not a pre-step.
Classical RAG is a pipeline: retrieve once, augment, generate. Agentic retrieval inverts this — the retriever is exposed as a tool the model can call, multiple times, with queries it chooses, and a control loop decides when enough has been gathered. It is the natural shape of RAG inside an agent, and it is the right shape for a small but important class of questions: multi-hop, comparison, exploration, and "did I get enough evidence to answer." This entry is about when to reach for it, how to bound it, and why the failure modes are different from a fixed pipeline's.
Why a fixed pipeline runs out of room.
A naive retrieve-then-generate pipeline has one shot at retrieval. The query goes in, top-k chunks come back, the generator works with what it got. This is the right shape when the question maps cleanly to one passage — "what's the refund window?" — and the wrong shape for everything else:
- Iterative refinement. The first retrieval reveals a term the user didn't know to ask for ("you mean the extended refund window, which is different"). A second retrieval keyed on the new term finds the actual answer. The pipeline cannot do this; it has already moved on to generation.
- Multi-hop reasoning. "Who reviewed the migration that broke the on-call alert?" requires retrieving the alert, then the migration referenced in it, then the reviewer of that migration. Each hop's query depends on the previous hop's result. Static decomposition covers the easy case where all sub-questions are known in advance; agentic retrieval covers the case where sub-questions are discovered.
- Multi-source comparison. "What does the docs say vs what the support ticket says?" needs the docs index and the support index, and the comparison is an LLM operation over the results.
- Sufficiency judgment. Sometimes one retrieval returns the answer; sometimes it returns three plausible chunks that contradict each other. A fixed pipeline cannot say "I need to look harder before I answer." An agentic loop can.
For these cases, the limitation is structural — the pipeline's one-shot retrieval cannot adapt to what it learns. The fix is to give the model the retriever as a tool and let it decide when to stop.
Search as a tool: the basic shape.
The minimal version of agentic retrieval is one tool, one loop:
# expose retrieval as a tool the model calls in a ReAct-style loop TOOLS = [{ "name": "search", "description": "Search the company knowledge base. Returns up to 5 passages with source citations. Call multiple times with different queries to gather evidence before answering.", "input_schema": {"query": "string"} }] def agentic_rag(question, max_steps=6): msgs = [{"role": "user", "content": question}] for _ in range(max_steps): resp = llm.respond(msgs, tools=TOOLS) if resp.tool_calls: for call in resp.tool_calls: result = retriever.search(call.query, k=5) msgs.append(tool_result(call.id, result)) else: return resp.text # model stopped calling tools = it's ready return "exceeded retrieval budget"
That is the entire pattern: the model reads the question, decides to call search with a query of its choice, reads the results, decides whether to call search again with a refined query or answer now. This is the same loop as ReAct, pointed at a knowledge base instead of generic tools. The tool description matters a lot — "call multiple times with different queries" is the part that unlocks multi-step behavior; without that hint, most models default to a single call.
The harness still owns: the retriever quality (see hybrid search and reranking), what the corpus looks like (see document parsing), and the budget. The model owns: how many calls, with what queries, in what order, and when to stop.
Multiple tools, not just one.
The real power shows up when retrieval is one of several tools, not the only one. A research agent might have:
search_docs— the internal documentation index.search_tickets— the support ticket corpus, with date-range filters.run_sql— structured queries against the analytics warehouse.web_search— external search when the question is out of corpus.read_url— fetch the full text of a result the model decided was important.
The model now picks not just what to query but where. This subsumes the static query router from the previous entry — same idea, runtime decision. The trade-off is that more tools enlarge the action space, which raises latency (more decisions to make) and the risk of mis-routing (calling SQL with what should have been a doc-search query).
A practical heuristic: tools should be heterogeneous enough that the right choice is usually clear from the question. Two retrievers over similar content with overlapping descriptions will confuse the model and degrade both. If you have two doc indices that feel similar, fuse them behind one tool and route inside the tool, not in the model's prompt.
Stopping: the hard part.
The dominant failure mode of agentic retrieval is not under-searching; it is the opposite. Models will happily keep calling search forever if you let them, polishing an already-sufficient answer, exploring tangents, or compulsively re-checking. Three stopping levers, in increasing sophistication:
- Hard step budget. A
max_stepslike 4–8 prevents pathological loops. The cap is dollar-and-latency insurance, not a quality lever; if you regularly hit it on valid questions, raise it. - Diminishing-returns check. If two consecutive searches return overlapping results (high Jaccard on doc IDs), the agent is repeating itself and likely will not learn more. Force a transition to answering. This catches the most common pathological pattern.
- Sufficiency self-check. After each retrieval, prompt the model: "given the evidence so far, can you answer the question? If yes, stop. If no, what specific gap would another search fill?" This makes stopping a deliberate decision rather than an absence of decision. It costs an extra small generation per step but is usually the highest-quality lever.
"Just let the model decide when it's done" is the failure mode dressed as a feature. Models without explicit stopping discipline will burn budget on confirmation searches that add no new information — the same confidence-seeking pattern that drives hallucination. Always combine at least the step budget with one quality-aware criterion.
The new failure modes.
Agentic retrieval inherits the failure modes of the underlying retriever (a bad index is still a bad index) and adds its own:
- Looping without progress. The agent searches, gets results, doesn't like them, searches again with a near-identical query, gets the same results. Mitigation: the diminishing-returns check from Step 4, plus an explicit instruction that "if the same query returned no new evidence, try a substantially different angle or stop."
- Premature stop. The agent answers from the first plausible-looking retrieval without checking whether other documents disagree. Mitigation: the sufficiency self-check, and (for high-stakes questions) a forced second retrieval from a different angle before allowing the answer.
- Drift. The agent's query rewrites pull the search away from the original question. By step 4 it is searching for something tangentially related and answering that instead. Mitigation: include the original question in every retrieval-step prompt as an anchor, and log the queries so drift is visible in traces.
- Cost blow-up. One agent run can issue 10–20 retrieval calls; at scale this dominates the bill. Mitigation: tight
max_steps, smaller models for the orchestration turns (the search-or-answer decision rarely needs a frontier model), and a per-session retrieval budget independent of step count. - Untrustworthy retrieved content. Every chunk fetched from the index is untrusted data and a potential prompt-injection vector. This risk is worse than in fixed RAG because the agent will read the injection's instruction and then issue more tool calls on its behalf. See RAG security for the defenses.
When to reach for it, and when not.
The decision is mostly question-shape, not corpus size. Use agentic retrieval when:
- Questions are multi-hop or comparison-shaped, with sub-queries that depend on previous results.
- Multiple heterogeneous sources need to be combined per question (docs + SQL + web).
- "Insufficient evidence" is a meaningful answer and you need the agent to recognize and admit it rather than always producing something.
- The task is exploratory — research, investigation, debugging — where the user cannot pre-specify what evidence is needed.
Stick with a fixed pipeline when:
- Questions are single-hop and well-bounded (FAQ, single-doc lookup). A good two-stage retriever is faster, cheaper, and more predictable.
- Latency is the dominant constraint (sub-second responses). The agentic loop adds at least one extra LLM call per retrieval; budgets do not fit.
- Auditability matters more than flexibility — a deterministic pipeline is easier to reason about and replay than a tool-calling loop.
- You haven't yet built the eval set to detect agentic-retrieval failure modes. An agent that fails silently with extra cost is worse than a pipeline that fails predictably.
The honest summary: agentic retrieval is not a strict upgrade over fixed RAG. It is a different tool for a different class of question. Start with the simplest pipeline that solves the bulk of your traffic, identify the queries it cannot serve, and add the loop only for those. The control loop is what makes it agentic — and that control loop is exactly the surface that needs the budgeting, stopping, and evaluation discipline this entry has spent six steps describing.