Search first, then generate — grounding LLMs in your own data
RAG is a pattern that augments an LLM's generation with external knowledge retrieved at query time. Instead of relying solely on what the model memorized during training, you fetch relevant documents from your own data store and inject them into the prompt as context.
User Query
│
▼
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Retriever │────▶│ Retrieved Docs │────▶│ LLM │
│ (vector DB) │ │ (top-k chunks) │ │ (generation) │
└─────────────┘ └──────────────────┘ └─────────────┘
│
▼
Final Answer
(grounded in your data)
Think of it like a developer who, instead of answering from memory, first runs a search across the docs, reads the relevant sections, then gives you an informed answer.
RAG separates knowledge (what the system knows) from reasoning (how it thinks). The LLM provides reasoning; your data store provides knowledge. This means: update knowledge without retraining, ground answers in verifiable sources, and control exactly what information the model can access.
| Problem | How RAG Solves It |
|---|---|
| Knowledge cutoff — LLMs don't know your internal data | Retrieves current, private data at query time |
| Hallucination — LLMs fabricate facts | Grounds answers in actual source documents |
| Cost of fine-tuning — retraining is expensive and slow | Just update the document store, no retraining |
| Attribution — users need to verify claims | Cite the exact source chunks used |
| Data freshness — information changes daily | Re-index documents; LLM always sees latest data |
| Use Case | Example | Why RAG |
|---|---|---|
| Customer support bots | Zendesk, Intercom | Search help center by intent, not keywords |
| Enterprise search | Glean, Notion AI | "Find the design doc about auth migration" |
| Code assistance | Cursor, GitHub Copilot | RAG over your codebase for context-aware help |
| Legal / compliance | Harvey AI | Query case law from millions of documents |
| Developer tools | Internal platforms | Query runbooks, API docs, incident history |
Rule of thumb: If your LLM needs to answer questions about data it wasn't trained on, you need RAG.
Keyword matching. "python error" matches exact term. Good for error codes, function names, identifiers.
Semantic matching. "python error" matches "bug in my script". Good for intent, paraphrases, concepts.
Combine sparse + dense with Reciprocal Rank Fusion. Best of both worlds. Production standard.
Cross-encoder rescores top-K candidates. More accurate than bi-encoder but too slow for first stage.
| Strategy | How | Best For |
|---|---|---|
| Fixed-size | Split every N characters | Uniform content (logs) |
| Sentence-aware | Accumulate sentences up to limit | Prose, articles |
| Recursive (separators) | Split on \n\n → \n → . → space | Markdown, code, structured docs |
| Semantic | Split where meaning shifts | High-value content |
Sweet spot: 256–512 tokens with 10-15% overlap. Recursive splitting is the production default.
Scenario: Internal engineering docs chatbot. 2,000+ pages across Confluence, Notion, GitHub wikis. Engineers waste hours searching.
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Docs (MD, │────→│ Chunking + │────→│ Vector DB │
│ PDF, HTML) │ │ Embedding │ │ (FAISS) │
└─────────────┘ └──────────────┘ └──────────────┘
│
┌─────────────┐ ┌──────────────┐ │
│ User Query │────→│ Embed Query │──── search ┘
│ │ └──────────────┘
│ │ │
│ │ ┌──────────────┐ ┌──────────────┐
│ │ │ Top-K Chunks│────→│ Claude LLM │
│ │ └──────────────┘ │ + Context │
│ │ └──────┬───────┘
│ Response │←────────────────────────────────┘
└─────────────┘
The demo in app.py implements exactly this pipeline using sample markdown docs about FastAPI, database patterns, and observability.
Code files: app.py (full RAG demo), chunking.py (text chunking), vector_store.py (FAISS abstraction) — both cloned from module 02. Sample docs in docs/.
┌────────────────────────────────────────────────────────────────┐ │ API Gateway │ │ (Auth, Rate Limit, Routing) │ └──────────────┬──────────────────────────────────┬──────────────┘ │ │ ┌──────────▼──────────┐ ┌───────────▼───────────┐ │ Ingestion Service │ │ Query Service │ │ │ │ │ │ • Receive documents │ │ • Embed user query │ │ • Detect changes │ │ • Vector search │ │ • Chunk text │ │ • Re-rank results │ │ • Generate embeds │ │ • Build LLM prompt │ │ • Store in vector DB│ │ • Stream response │ │ • Track doc→chunk │ │ • Return citations │ └──────────┬───────────┘ └───────────┬────────────┘ │ │ ┌──────────▼───────────────────────────────────▼────────────┐ │ Vector Database │ │ (pgvector / Pinecone / Qdrant / Weaviate) │ │ Stores: vector + metadata + chunk text │ └───────────────────────────────────────────────────────────┘
| Scale | Vector Store | Why |
|---|---|---|
| < 100K vectors | FAISS in-memory | Simple, fast, no infra |
| 100K – 10M | pgvector (PostgreSQL) | SQL filtering + vectors, one DB to manage |
| 10M – 100M | Qdrant / Weaviate | Purpose-built, better perf at scale |
| 100M+ | Pinecone / custom sharding | Managed, distributed, billion-scale |
Too large (> 1000 tokens): meaning diluted. Too small (< 100 tokens): context lost. Fix: 256–512 tokens with 10-15% overlap. Test with your actual queries. This is the most impactful tuning knob.
If retrieval returns the wrong chunks, the LLM cannot give the right answer. Most engineers only evaluate the final answer. Fix: Build a retrieval eval set: 50+ (query, expected_doc) pairs. Measure recall@5. This matters more than evaluating the LLM.
Vector search ALWAYS returns results — even when nothing relevant exists. Fix: Set a min_score threshold. Below that, return "I don't have information about this" instead of hallucinating from irrelevant context.
Retrieving 20 chunks drowns the signal in noise. "Lost in the middle" problem: LLMs pay less attention to middle content. Fix: Start with 3-5 chunks. Quality over quantity.
Using different embedding models for indexing vs. querying produces garbage. Vectors must be in the same space. Fix: Version your indexes with model name. Re-embed everything when switching models.
Deleting a source doc without removing its chunks creates "zombie" chunks — citing deleted documents. Fix: Track document → chunk mappings. Delete old chunks when re-indexing.
Vector similarity alone isn't enough. Filter by department, date, access level before searching. Fix: Store metadata on every chunk. Pre-filter, then vector search.
Combine keyword search with vector search using Reciprocal Rank Fusion. Production standard — pure vector misses exact matches (error codes), pure keyword misses semantic matches.
Use bi-encoder for first-stage retrieval (top-100), cross-encoder to re-rank to top-5. Cross-encoders process query+document together — slower but much more accurate.
Rewrite queries before searching. HyDE: generate a hypothetical answer, embed that. Multi-query: generate reformulations, retrieve for each, merge. Step-back: abstract the question first.
Give the LLM a "search" tool — let it decide when and what to retrieve. Multi-step retrieval, query refinement, result combination. How modern AI assistants work.
Build a knowledge graph from documents, traverse relationships during retrieval. Powerful for multi-hop questions: "Who manages the team that owns the billing service?"
RAGAS: measures faithfulness, relevance, context precision/recall. Build custom eval sets from real user questions. A/B test retrieval strategies with real traffic.
Chunk the sample docs using 3 strategies (fixed-size, sentence-aware, recursive from module 02). For 5 test queries, compare which strategy retrieves the most relevant chunks. Output a comparison table. Success: Identify which strategy performs best for your queries.
Extend the retriever: add BM25 keyword search (rank_bm25 library), implement Reciprocal Rank Fusion to merge result sets, compare hybrid vs. vector-only. Success: Query "alembic downgrade" returns results from keyword match AND semantic similarity.
Create 10 Q&A pairs from sample docs (ground truth). Run each through the RAG pipeline. Measure retrieval precision, recall, and generation faithfulness. Success: A JSON report showing quality metrics per question.