RAG Strategies: Pre-Retrieval, Retrieval, and Post-Retrieval

Context

RAG (Retrieval-Augmented Generation) seems simple in theory: retrieve relevant documents, feed them to an LLM, generate a response. In practice, each of these steps has dozens of optimization levers.

This visual emerged from production RAG implementations where teams struggled with poor retrieval quality. Simply throwing documents into a vector database and hoping for the best doesn't work—you need strategic optimization at each pipeline stage.

When to Use This Visual

Ideal for:

Technical architecture reviews for RAG systems
ML engineering team training
Debugging poor retrieval quality
Planning RAG optimization roadmaps

Target Audience:

ML engineers building RAG systems
Data scientists optimizing retrieval
Backend engineers integrating RAG into products
Technical leads evaluating RAG approaches

Phase 1: Pre-Retrieval Strategies

Goal: Improve query quality before hitting the retrieval system.

Query Expansion

Transform sparse queries into richer semantic representations:

Example: "MCP security" → "Model Context Protocol authentication authorization encryption audit trails"
Technique: Use synonym expansion, acronym expansion, or LLM-based query enrichment

Query Re-writing

Reformulate unclear or ambiguous queries:

Example: "How do I fix it?" → "How do I debug permission errors in enterprise RAG systems?"
Technique: Prompt an LLM to clarify and expand user intent

Knowledge Graph Enrichment

Use structured knowledge to expand query context:

Example: Query about "RAG" also retrieves related concepts like "embeddings," "vector databases," "chunking strategies"
Technique: Maintain a domain knowledge graph, traverse relationships

When to use:

User queries are typically short or ambiguous
Domain has rich structured relationships
Retrieval recall is low (missing relevant documents)

Phase 2: Retrieval Methods

Goal: Find the most relevant documents from the knowledge base.

Dense Retrieval (Embeddings)

Semantic search using vector embeddings:

How it works: Embed query and documents into vector space, find nearest neighbors
Strengths: Captures semantic similarity, handles synonyms
Weaknesses: Can miss exact keyword matches, computationally expensive

Sparse Retrieval (BM25)

Keyword-based search using term frequency:

How it works: Traditional information retrieval, ranks by term overlap
Strengths: Fast, works well for exact matches, explainable
Weaknesses: Misses semantic similarity (e.g., "AI" vs "artificial intelligence")

Hybrid Retrieval

Combine dense and sparse methods:

How it works: Run both retrievers, merge results using a fusion algorithm (e.g., Reciprocal Rank Fusion)
Strengths: Gets the best of both worlds—semantic + keyword matching
Weaknesses: More complex, requires tuning fusion weights

Production recommendation: Start with hybrid. Dense-only often misses critical exact matches (product names, acronyms, technical terms).

Phase 3: Post-Retrieval Refinements

Goal: Optimize retrieved context before sending to the LLM.

Context Reranking

Re-score retrieved documents for relevance:

How it works: Use a cross-encoder model to score query-document pairs
Strengths: More accurate than initial retrieval, improves Top-K quality
Weaknesses: Slower (O(n) for n documents)

When to use: When you retrieve 50-100 candidate documents but only want the top 5-10 for the LLM.

Summarization

Condense long documents to fit context windows:

How it works: Use extractive or abstractive summarization
Strengths: Saves tokens, removes noise
Weaknesses: May lose critical details

When to use: Retrieved documents are long-form (whitepapers, docs) but LLM context is limited.

Filtering/Selection

Remove low-quality or irrelevant results:

How it works: Apply threshold scores, deduplication, recency filters
Strengths: Improves signal-to-noise ratio
Weaknesses: May accidentally filter out useful edge cases

When to use: Retrieval returns noisy results (duplicates, outdated docs, low-relevance content).

Putting It All Together

A production RAG pipeline might use:

Pre-Retrieval: Query expansion for short queries
Retrieval: Hybrid (dense + BM25) with Reciprocal Rank Fusion
Post-Retrieval: Rerank top 50 → select top 10 → summarize if needed

Key insight: Don't optimize everything at once. Start with hybrid retrieval (biggest win), then add reranking if quality is still poor, then explore pre-retrieval if recall is low.

Evaluation is Critical

For every strategy, measure:

Precision@K: Are the top K results actually relevant?
Recall@K: Did we retrieve all relevant documents in the top K?
Latency: How long does retrieval take?
Cost: Embedding models, reranking models, and LLMs all have costs

Build eval harnesses before features—you can't optimize what you don't measure.

Related Patterns

Chunking Strategies: How to split documents for optimal retrieval
Embedding Model Selection: OpenAI vs. open-source, domain-specific fine-tuning
Permission-Aware RAG: Filtering retrieved results by user permissions