Query Understanding & Transformation

R5
Deep Dive · Retrieval & RAG

Query understanding and transformation: fixing the question before you search.

Most RAG investigations end at the retriever and never look at the input. But the query is the probe into the index, and a bad probe guarantees a bad retrieval no matter how good the downstream stack is. "Cannot log in" is not literally "authentication failure," "and what about its pricing?" is not a standalone question, and "compare Q3 to Q4 margins" is not one query — it is two. This entry is about the transformations that happen before hybrid retrieve, and the small number of them worth the latency and the prompt budget.

STEP 1

The query is the bottleneck more often than the retriever.

Every retriever — lexical, dense, hybrid, even two-stage rerank — assumes the query as posed is the right probe. That assumption breaks in predictable ways:

  • Conversational queries. "And what about for enterprise?" depends entirely on what was discussed three turns ago. Embedding that string and searching the corpus retrieves chunks about "enterprise" in random contexts.
  • Vocabulary mismatch. Users describe symptoms; documentation describes mechanisms. "My screen keeps freezing" does not embed close to "GPU driver memory leak in v2.4."
  • Compound questions. "What changed in v3 compared to v2 and is the migration breaking?" needs evidence from at least three different documents. A single retrieval has to hope one chunk covers all three, which is rare.
  • Under-specified queries. "What's the policy?" with no qualifier returns the first chunk that says "policy," which is almost never the right one.
  • Specification overload. A long, multi-clause question with too many constraints embeds to nowhere in particular — the query vector lands in the middle of a region with no clear nearest neighbor.

The diagnostic from hybrid search and reranking — "if the answer string is in the corpus but not in the top-50, the retriever is the bug" — has a sibling: if the answer is in the corpus, the retriever finds it for a hand-written ideal query, but not for the user's actual query, then the bug is in the query. Different fix.

STEP 2

Rewriting: the low-risk default.

Query rewriting is a single cheap LLM call that turns the raw query into a standalone, search-optimized version. It resolves pronouns, expands acronyms, drops conversational filler, and adds context from the conversation history. It is the lowest-risk transform on this page and the only one most systems can safely turn on by default.

# conversational query rewrite — resolve references, drop filler
REWRITE_PROMPT = """
Given the conversation so far and the user's latest message,
write ONE standalone search query that captures what the user is
actually asking. Resolve pronouns, expand acronyms, keep entity
names exact. Output the query only, no preamble.
"""

def rewrite_for_retrieval(history, latest):
    msg = format_history(history) + "\nLatest: " + latest
    return small_llm.complete(REWRITE_PROMPT, msg).strip()

Use a small, fast model (Haiku-class, Llama-3.1-8B). The output is short and the latency budget for query understanding is tight — this call has to finish before retrieval can start. Two anti-patterns to avoid: rewriting away entity names because the model thinks they look strange (preserve verbatim spans), and over-expanding (a one-sentence query rewritten into a paragraph confuses the embedding model more than the original).

For non-conversational, one-shot queries the rewrite is often a no-op — do not call the LLM if there is no history and the query is already self-contained. A simple length-and-conversation-state check skips the rewrite when it cannot help.

STEP 3

Decomposition: when one question is actually several.

Compound and comparison questions are a different problem from rewriting. "Compare Q3 to Q4 margins" cannot be answered by retrieving "compare Q3 to Q4 margins" against the corpus — the document with the comparison doesn't exist. The Q3 number lives in one document, the Q4 number in another, and the comparison has to be assembled by the generator.

Query decomposition splits a compound query into sub-queries, retrieves for each, and concatenates the evidence:

# decompose, retrieve per sub-query, then generate over the union
def decompose_and_retrieve(query, retriever):
    sub_queries = llm_decompose(query)   # 1..N standalone questions
    evidence    = []
    for sq in sub_queries:
        evidence.extend(retriever.search(sq, k=5))
    return dedupe(evidence)         # by doc_id, preserve order

Decomposition is most useful for explicit comparisons ("X vs Y"), multi-entity questions ("the policy for plan A, B, and C"), and questions with implicit sub-steps ("is this migration safe?" → "what changed in the schema?" + "what code depends on those columns?"). It is wasteful for simple single-fact lookups, so the decomposer should be allowed to return one sub-query unchanged when the input is already atomic.

The agentic-RAG variant of this idea — let the agent decide on the fly how many retrievals to issue and what to ask for — is covered in advanced RAG architectures. Static decomposition is the cheaper, lower-latency version: one LLM call to split, then a parallel batch of retrievals, no loop.

STEP 4

Multi-query, HyDE, and step-back: paraphrase the probe, not the question.

If decomposition handles "this is several questions," multi-query handles "this is one question expressed in only one way." The model generates 3–5 paraphrases of the query, each is searched independently, and the results are fused with RRF (see hybrid search and reranking). The win is recall: if "cannot log in" misses but "authentication failure" hits, the union finds it. The cost is roughly N× the retrieval calls, which usually run in parallel and matter for cost more than latency.

HyDE (Hypothetical Document Embeddings, Gao et al., 2022) generates a hypothetical answer to the query — not paraphrases — and embeds that hypothetical answer to search with. The intuition: a fake answer is shaped like the real passages in the index, so it is closer to them in embedding space than the bare question is. HyDE reliably helps weak or zero-shot retrievers. It can hurt strong fine-tuned dense retrievers, because the hypothetical document introduces a distribution shift and a chance to confabulate the wrong entities, which then drag retrieval off-target. The honest rule is to A/B it on your retriever and your corpus, not to adopt it as a default. This is the same caution from advanced RAG architectures, repeated because the failure mode is common.

Step-back prompting (Zheng et al., 2023) does the opposite of decomposition: it generalizes the query to a higher-level question ("What's the maximum number of API calls per minute for free tier?""What are the free-tier rate limits?") and retrieves with that broader form, often in addition to the original. It helps when the user's specific question is over-constrained for the embedding model but the broader topic exists in the corpus.

The three transforms above all expand recall by issuing more retrievals on more probes. They also raise the rate at which off-topic chunks enter the candidate pool, which then dilute the cross-encoder's view. Always pair query-expansion transforms with a real reranker downstream and a calibrated k_final — otherwise you trade retrieval recall for generation precision and end up worse off.

STEP 5

Routing: choosing where to search before choosing how.

"What is our refund policy?" should go to the support knowledge base. "What was the Q4 revenue?" should go to the financial filings. "What's the syntax for grouping in Postgres?" should go to the Postgres docs index. Sending every query to every index is wasteful (latency, cost) and harmful (off-corpus chunks leak in and pollute the candidate set).

Query routing is a classification step that picks one or a small set of indices to search. Two practical implementations:

  • LLM-based router. A small model gets a list of indices with one-line descriptions and the query, and returns one or more index names. Cheap, flexible, easy to extend — just add a new index and a new line of description.
  • Embedding-based router. Each index has a representative embedding (centroid of its documents, or a written summary). The query embedding is matched to the closest index. Very fast, no extra LLM call, but harder to keep calibrated as indices drift.

Routing also enables non-vector sources: a query that looks like a structured lookup ("orders from customer X in March") should go to SQL, not to the document index. A query that looks like a date or numeric range belongs to a filter, not a similarity search. The router's job is to recognize what kind of probe the query is and pick the right tool. This shades smoothly into agentic retrieval (see advanced RAG architectures) — routing is the static, single-step version of the same idea.

STEP 6

Putting it together: a default query-understanding pipeline.

Most systems do not need all six transforms above. A defensible default for a production RAG system, in order:

# before retrieval: a layered, mostly-skippable pipeline
def prepare_query(query, history):
    # 1. cheap rewrite if there is conversation context
    if history:
        query = rewrite_for_retrieval(history, query)

    # 2. route to the right index/tool
    index = route(query)               # may return SQL, KB, web, etc.

    # 3. decompose only if the query has multiple sub-questions
    sub_queries = decompose_if_needed(query)

    return {"index": index, "queries": sub_queries}

Multi-query and HyDE are experiments, not steps in this default. Turn them on per-query when recall is the demonstrated failure mode, and turn them off when answer precision degrades. Step-back is a useful escape valve when a query is over-constrained, but it should not be a default either.

The unifying principle: query understanding is the cheapest place to add intelligence to a RAG system, because everything downstream — retrievers, rerankers, generators — is conditional on the query being a sensible probe. Spend a small LLM call here before spending a large one at generation. And measure: any query transform that doesn't move final answer accuracy on a labeled eval set (see evaluating RAG) is overhead, regardless of how clever it looks in isolation.