Context
RAG (Retrieval-Augmented Generation) seems simple in theory: retrieve relevant documents, feed them to an LLM, generate a response. In practice, each of these steps has dozens of optimization levers.
This visual emerged from production RAG implementations where teams struggled with poor retrieval quality. Simply throwing documents into a vector database and hoping for the best doesn't work—you need strategic optimization at each pipeline stage.
When to Use This Visual
Ideal for:
- Technical architecture reviews for RAG systems
- ML engineering team training
- Debugging poor retrieval quality
- Planning RAG optimization roadmaps
Target Audience:
- ML engineers building RAG systems
- Data scientists optimizing retrieval
- Backend engineers integrating RAG into products
- Technical leads evaluating RAG approaches
Phase 1: Pre-Retrieval Strategies
Goal: Improve query quality before hitting the retrieval system.
Query Expansion
Transform sparse queries into richer semantic representations:
- Example: "MCP security" → "Model Context Protocol authentication authorization encryption audit trails"
- Technique: Use synonym expansion, acronym expansion, or LLM-based query enrichment
Query Re-writing
Reformulate unclear or ambiguous queries:
- Example: "How do I fix it?" → "How do I debug permission errors in enterprise RAG systems?"
- Technique: Prompt an LLM to clarify and expand user intent
Knowledge Graph Enrichment
Use structured knowledge to expand query context:
- Example: Query about "RAG" also retrieves related concepts like "embeddings," "vector databases," "chunking strategies"
- Technique: Maintain a domain knowledge graph, traverse relationships
When to use:
- User queries are typically short or ambiguous
- Domain has rich structured relationships
- Retrieval recall is low (missing relevant documents)
Phase 2: Retrieval Methods
Goal: Find the most relevant documents from the knowledge base.
Dense Retrieval (Embeddings)
Semantic search using vector embeddings:
- How it works: Embed query and documents into vector space, find nearest neighbors
- Strengths: Captures semantic similarity, handles synonyms
- Weaknesses: Can miss exact keyword matches, computationally expensive
Sparse Retrieval (BM25)
Keyword-based search using term frequency:
- How it works: Traditional information retrieval, ranks by term overlap
- Strengths: Fast, works well for exact matches, explainable
- Weaknesses: Misses semantic similarity (e.g., "AI" vs "artificial intelligence")
Hybrid Retrieval
Combine dense and sparse methods:
- How it works: Run both retrievers, merge results using a fusion algorithm (e.g., Reciprocal Rank Fusion)
- Strengths: Gets the best of both worlds—semantic + keyword matching
- Weaknesses: More complex, requires tuning fusion weights
Production recommendation: Start with hybrid. Dense-only often misses critical exact matches (product names, acronyms, technical terms).
Phase 3: Post-Retrieval Refinements
Goal: Optimize retrieved context before sending to the LLM.
Context Reranking
Re-score retrieved documents for relevance:
- How it works: Use a cross-encoder model to score query-document pairs
- Strengths: More accurate than initial retrieval, improves Top-K quality
- Weaknesses: Slower (O(n) for n documents)
When to use: When you retrieve 50-100 candidate documents but only want the top 5-10 for the LLM.
Summarization
Condense long documents to fit context windows:
- How it works: Use extractive or abstractive summarization
- Strengths: Saves tokens, removes noise
- Weaknesses: May lose critical details
When to use: Retrieved documents are long-form (whitepapers, docs) but LLM context is limited.
Filtering/Selection
Remove low-quality or irrelevant results:
- How it works: Apply threshold scores, deduplication, recency filters
- Strengths: Improves signal-to-noise ratio
- Weaknesses: May accidentally filter out useful edge cases
When to use: Retrieval returns noisy results (duplicates, outdated docs, low-relevance content).
Putting It All Together
A production RAG pipeline might use:
- Pre-Retrieval: Query expansion for short queries
- Retrieval: Hybrid (dense + BM25) with Reciprocal Rank Fusion
- Post-Retrieval: Rerank top 50 → select top 10 → summarize if needed
Key insight: Don't optimize everything at once. Start with hybrid retrieval (biggest win), then add reranking if quality is still poor, then explore pre-retrieval if recall is low.
Evaluation is Critical
For every strategy, measure:
- Precision@K: Are the top K results actually relevant?
- Recall@K: Did we retrieve all relevant documents in the top K?
- Latency: How long does retrieval take?
- Cost: Embedding models, reranking models, and LLMs all have costs
Build eval harnesses before features—you can't optimize what you don't measure.
Related Patterns
- Chunking Strategies: How to split documents for optimal retrieval
- Embedding Model Selection: OpenAI vs. open-source, domain-specific fine-tuning
- Permission-Aware RAG: Filtering retrieved results by user permissions
