Back to Visual CornerWhiteboards

RAG Strategies: Pre-Retrieval, Retrieval, and Post-Retrieval

A detailed technical whiteboard covering the three phases of RAG optimization: pre-retrieval techniques (query expansion, rewriting, knowledge graph enrichment), retrieval methods (dense embeddings, sparse BM25, hybrid approaches), and post-retrieval refinements (context reranking, summarization, filtering).

Whiteboard diagram showing RAG workflow with user query flowing through retriever to relevant context to LLM generator for augmented answer. Includes three strategy phases: pre-retrieval (query expansion, rewriting, knowledge graph), retrieval (dense embeddings, sparse BM25, hybrid), and post-retrieval (context reranking, summarization, filtering)

Click to zoom

Complete RAG optimization strategies across the retrieval pipeline

Key Takeaways

  • Pre-retrieval strategies like query expansion and knowledge graph enrichment improve retrieval quality by reformulating user queries before searching
  • Hybrid retrieval combining dense (embedding-based) and sparse (BM25 keyword) methods outperforms single-method approaches for most enterprise use cases
  • Post-retrieval techniques like context reranking and summarization ensure only the most relevant, concise information reaches the LLM for generation

Context

RAG (Retrieval-Augmented Generation) seems simple in theory: retrieve relevant documents, feed them to an LLM, generate a response. In practice, each of these steps has dozens of optimization levers.

This visual emerged from production RAG implementations where teams struggled with poor retrieval quality. Simply throwing documents into a vector database and hoping for the best doesn't work—you need strategic optimization at each pipeline stage.

When to Use This Visual

Ideal for:

  • Technical architecture reviews for RAG systems
  • ML engineering team training
  • Debugging poor retrieval quality
  • Planning RAG optimization roadmaps

Target Audience:

  • ML engineers building RAG systems
  • Data scientists optimizing retrieval
  • Backend engineers integrating RAG into products
  • Technical leads evaluating RAG approaches

Phase 1: Pre-Retrieval Strategies

Goal: Improve query quality before hitting the retrieval system.

Query Expansion

Transform sparse queries into richer semantic representations:

  • Example: "MCP security" → "Model Context Protocol authentication authorization encryption audit trails"
  • Technique: Use synonym expansion, acronym expansion, or LLM-based query enrichment

Query Re-writing

Reformulate unclear or ambiguous queries:

  • Example: "How do I fix it?" → "How do I debug permission errors in enterprise RAG systems?"
  • Technique: Prompt an LLM to clarify and expand user intent

Knowledge Graph Enrichment

Use structured knowledge to expand query context:

  • Example: Query about "RAG" also retrieves related concepts like "embeddings," "vector databases," "chunking strategies"
  • Technique: Maintain a domain knowledge graph, traverse relationships

When to use:

  • User queries are typically short or ambiguous
  • Domain has rich structured relationships
  • Retrieval recall is low (missing relevant documents)

Phase 2: Retrieval Methods

Goal: Find the most relevant documents from the knowledge base.

Dense Retrieval (Embeddings)

Semantic search using vector embeddings:

  • How it works: Embed query and documents into vector space, find nearest neighbors
  • Strengths: Captures semantic similarity, handles synonyms
  • Weaknesses: Can miss exact keyword matches, computationally expensive

Sparse Retrieval (BM25)

Keyword-based search using term frequency:

  • How it works: Traditional information retrieval, ranks by term overlap
  • Strengths: Fast, works well for exact matches, explainable
  • Weaknesses: Misses semantic similarity (e.g., "AI" vs "artificial intelligence")

Hybrid Retrieval

Combine dense and sparse methods:

  • How it works: Run both retrievers, merge results using a fusion algorithm (e.g., Reciprocal Rank Fusion)
  • Strengths: Gets the best of both worlds—semantic + keyword matching
  • Weaknesses: More complex, requires tuning fusion weights

Production recommendation: Start with hybrid. Dense-only often misses critical exact matches (product names, acronyms, technical terms).

Phase 3: Post-Retrieval Refinements

Goal: Optimize retrieved context before sending to the LLM.

Context Reranking

Re-score retrieved documents for relevance:

  • How it works: Use a cross-encoder model to score query-document pairs
  • Strengths: More accurate than initial retrieval, improves Top-K quality
  • Weaknesses: Slower (O(n) for n documents)

When to use: When you retrieve 50-100 candidate documents but only want the top 5-10 for the LLM.

Summarization

Condense long documents to fit context windows:

  • How it works: Use extractive or abstractive summarization
  • Strengths: Saves tokens, removes noise
  • Weaknesses: May lose critical details

When to use: Retrieved documents are long-form (whitepapers, docs) but LLM context is limited.

Filtering/Selection

Remove low-quality or irrelevant results:

  • How it works: Apply threshold scores, deduplication, recency filters
  • Strengths: Improves signal-to-noise ratio
  • Weaknesses: May accidentally filter out useful edge cases

When to use: Retrieval returns noisy results (duplicates, outdated docs, low-relevance content).

Putting It All Together

A production RAG pipeline might use:

  1. Pre-Retrieval: Query expansion for short queries
  2. Retrieval: Hybrid (dense + BM25) with Reciprocal Rank Fusion
  3. Post-Retrieval: Rerank top 50 → select top 10 → summarize if needed

Key insight: Don't optimize everything at once. Start with hybrid retrieval (biggest win), then add reranking if quality is still poor, then explore pre-retrieval if recall is low.

Evaluation is Critical

For every strategy, measure:

  • Precision@K: Are the top K results actually relevant?
  • Recall@K: Did we retrieve all relevant documents in the top K?
  • Latency: How long does retrieval take?
  • Cost: Embedding models, reranking models, and LLMs all have costs

Build eval harnesses before features—you can't optimize what you don't measure.

Related Patterns

  • Chunking Strategies: How to split documents for optimal retrieval
  • Embedding Model Selection: OpenAI vs. open-source, domain-specific fine-tuning
  • Permission-Aware RAG: Filtering retrieved results by user permissions

Prompt Intent

Create a technical reference guide for ML engineers implementing production RAG systems, covering optimization strategies at each pipeline stage