// RAG — Retrieval-Augmented Generation

Search first, then generate — grounding LLMs in your own data

01. Core Concepts

What Is RAG?

RAG is a pattern that augments an LLM's generation with external knowledge retrieved at query time. Instead of relying solely on what the model memorized during training, you fetch relevant documents from your own data store and inject them into the prompt as context.

User Query
    │
    ▼
┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  Retriever   │────▶│  Retrieved Docs  │────▶│   LLM       │
│ (vector DB)  │     │  (top-k chunks)  │     │ (generation) │
└─────────────┘     └──────────────────┘     └─────────────┘
                                                    │
                                                    ▼
                                              Final Answer
                                          (grounded in your data)

Think of it like a developer who, instead of answering from memory, first runs a search across the docs, reads the relevant sections, then gives you an informed answer.

The Key Insight

RAG separates knowledge (what the system knows) from reasoning (how it thinks). The LLM provides reasoning; your data store provides knowledge. This means: update knowledge without retraining, ground answers in verifiable sources, and control exactly what information the model can access.

Module 02 taught you how to embed text and search by meaning. RAG is what makes that useful — it's the pattern that turns vector search into an intelligent Q&A system by adding an LLM on top.

02. Why It Matters

ProblemHow RAG Solves It
Knowledge cutoff — LLMs don't know your internal dataRetrieves current, private data at query time
Hallucination — LLMs fabricate factsGrounds answers in actual source documents
Cost of fine-tuning — retraining is expensive and slowJust update the document store, no retraining
Attribution — users need to verify claimsCite the exact source chunks used
Data freshness — information changes dailyRe-index documents; LLM always sees latest data

Where Companies Use This

Use CaseExampleWhy RAG
Customer support botsZendesk, IntercomSearch help center by intent, not keywords
Enterprise searchGlean, Notion AI"Find the design doc about auth migration"
Code assistanceCursor, GitHub CopilotRAG over your codebase for context-aware help
Legal / complianceHarvey AIQuery case law from millions of documents
Developer toolsInternal platformsQuery runbooks, API docs, incident history

Rule of thumb: If your LLM needs to answer questions about data it wasn't trained on, you need RAG.

03. Internal Mechanics

The Two Phases

A
Indexing (Offline)
Documents → Chunk → Embed → Store in Vector DB
Run once (or on a schedule) when source documents change
B
Query (Online)
User Query → Embed → Vector Search → Top-K Chunks
→ Build Prompt → LLM → Grounded Answer

Step-by-Step Query Flow

1
Embed Query
Convert the user's question into a vector
using the same embedding model as indexing
2
Vector Search
FAISS/HNSW finds the top-K most similar chunks
O(log n) with HNSW — fast even at millions of vectors
3
Filter & Rank
Apply similarity threshold (remove garbage results)
Optional: metadata filter, reranking with cross-encoder
4
Prompt Construction
Inject retrieved chunks into the LLM prompt
System prompt: "Answer ONLY from the context. Cite sources."
5
LLM Generation
Claude reads context + question → generates grounded answer
Low temperature (0.1-0.3) for factual, faithful output

Retrieval Strategies

Sparse (BM25)

Keyword matching. "python error" matches exact term. Good for error codes, function names, identifiers.

Dense (Embeddings)

Semantic matching. "python error" matches "bug in my script". Good for intent, paraphrases, concepts.

Hybrid (Both)

Combine sparse + dense with Reciprocal Rank Fusion. Best of both worlds. Production standard.

Reranking

Cross-encoder rescores top-K candidates. More accurate than bi-encoder but too slow for first stage.

Chunking — The #1 Quality Lever

StrategyHowBest For
Fixed-sizeSplit every N charactersUniform content (logs)
Sentence-awareAccumulate sentences up to limitProse, articles
Recursive (separators)Split on \n\n → \n → . → spaceMarkdown, code, structured docs
SemanticSplit where meaning shiftsHigh-value content

Sweet spot: 256–512 tokens with 10-15% overlap. Recursive splitting is the production default.

04. Practical Example

Scenario: Internal engineering docs chatbot. 2,000+ pages across Confluence, Notion, GitHub wikis. Engineers waste hours searching.

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Docs (MD,  │────→│  Chunking +  │────→│  Vector DB   │
│  PDF, HTML) │     │  Embedding   │     │  (FAISS)     │
└─────────────┘     └──────────────┘     └──────────────┘
                                               │
┌─────────────┐     ┌──────────────┐           │
│  User Query │────→│  Embed Query │──── search ┘
│             │     └──────────────┘
│             │            │
│             │     ┌──────────────┐     ┌──────────────┐
│             │     │  Top-K Chunks│────→│  Claude LLM  │
│             │     └──────────────┘     │  + Context   │
│             │                          └──────┬───────┘
│  Response   │←────────────────────────────────┘
└─────────────┘

The demo in app.py implements exactly this pipeline using sample markdown docs about FastAPI, database patterns, and observability.

05. Hands-on Implementation

Code files: app.py (full RAG demo), chunking.py (text chunking), vector_store.py (FAISS abstraction) — both cloned from module 02. Sample docs in docs/.

Quick Start

# Install dependencies (from repo root) pip install -r requirements.txt # Set your API keys cp .env.example .env # Edit .env with ANTHROPIC_API_KEY (and optionally VOYAGE_API_KEY) # Run the RAG demo cd 04-rag && python app.py

Step 1: Load Documents

def load_documents(docs_dir: Path) -> list[tuple[str, str]]: docs = [] for md_file in sorted(docs_dir.glob("**/*.md")): content = md_file.read_text(encoding="utf-8") docs.append((md_file.name, content)) return docs

Step 2: Chunk Documents

from chunking import RecursiveChunker chunker = RecursiveChunker(chunk_size=400, chunk_overlap=60) chunks = chunker.chunk(document_text, source="fastapi_guide.md") # Smaller chunks = more precise retrieval for RAG

Step 3: Embed & Build Index

# Embed all chunks in one batch (efficient) texts = [c.text for c in all_chunks] embeddings = embedder.embed(texts) # Build FAISS index store = FaissVectorStore(dimension=embedder.dimension) store.add_batch(embeddings, metadata_list)

Step 4: Retrieve Relevant Chunks

query_embedding = embedder.embed([query])[0] results = store.search(query_embedding, top_k=5, min_score=0.3) # min_score is critical — without it, the system always returns results # even when nothing is relevant, leading to hallucination

Step 5: Generate Grounded Answer with Claude

# Build context from retrieved chunks context_parts = [] for i, r in enumerate(results, 1): source = r.metadata["source"] text = r.metadata["text"] context_parts.append(f"[Source {i}: {source}]\n{text}") context = "\n\n---\n\n".join(context_parts) # Generate with Claude response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system="Answer ONLY based on the provided context. Cite your sources.", messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}], )

06. System Design — Production Architecture

┌────────────────────────────────────────────────────────────────┐
│                        API Gateway                              │
│                  (Auth, Rate Limit, Routing)                    │
└──────────────┬──────────────────────────────────┬──────────────┘
               │                                  │
    ┌──────────▼──────────┐           ┌───────────▼───────────┐
    │  Ingestion Service  │           │   Query Service        │
    │                      │           │                        │
    │  • Receive documents │           │  • Embed user query    │
    │  • Detect changes    │           │  • Vector search       │
    │  • Chunk text        │           │  • Re-rank results     │
    │  • Generate embeds   │           │  • Build LLM prompt    │
    │  • Store in vector DB│           │  • Stream response     │
    │  • Track doc→chunk   │           │  • Return citations    │
    └──────────┬───────────┘           └───────────┬────────────┘
               │                                   │
    ┌──────────▼───────────────────────────────────▼────────────┐
    │                    Vector Database                        │
    │  (pgvector / Pinecone / Qdrant / Weaviate)                │
    │  Stores: vector + metadata + chunk text                    │
    └───────────────────────────────────────────────────────────┘

Scaling Decisions

ScaleVector StoreWhy
< 100K vectorsFAISS in-memorySimple, fast, no infra
100K – 10Mpgvector (PostgreSQL)SQL filtering + vectors, one DB to manage
10M – 100MQdrant / WeaviatePurpose-built, better perf at scale
100M+Pinecone / custom shardingManaged, distributed, billion-scale

Key Production Concerns

07. Common Pitfalls

Critical Bad chunking — the #1 mistake

Too large (> 1000 tokens): meaning diluted. Too small (< 100 tokens): context lost. Fix: 256–512 tokens with 10-15% overlap. Test with your actual queries. This is the most impactful tuning knob.

Critical Not evaluating retrieval separately

If retrieval returns the wrong chunks, the LLM cannot give the right answer. Most engineers only evaluate the final answer. Fix: Build a retrieval eval set: 50+ (query, expected_doc) pairs. Measure recall@5. This matters more than evaluating the LLM.

Critical No similarity threshold

Vector search ALWAYS returns results — even when nothing relevant exists. Fix: Set a min_score threshold. Below that, return "I don't have information about this" instead of hallucinating from irrelevant context.

Important Stuffing too many chunks

Retrieving 20 chunks drowns the signal in noise. "Lost in the middle" problem: LLMs pay less attention to middle content. Fix: Start with 3-5 chunks. Quality over quantity.

Important Embedding model mismatch

Using different embedding models for indexing vs. querying produces garbage. Vectors must be in the same space. Fix: Version your indexes with model name. Re-embed everything when switching models.

Important Not handling document updates

Deleting a source doc without removing its chunks creates "zombie" chunks — citing deleted documents. Fix: Track document → chunk mappings. Delete old chunks when re-indexing.

Tip Ignoring metadata filtering

Vector similarity alone isn't enough. Filter by department, date, access level before searching. Fix: Store metadata on every chunk. Pre-filter, then vector search.

08. Advanced Topics

8.1 Hybrid Search (BM25 + Vector)

Combine keyword search with vector search using Reciprocal Rank Fusion. Production standard — pure vector misses exact matches (error codes), pure keyword misses semantic matches.

8.2 Reranking with Cross-Encoders

Use bi-encoder for first-stage retrieval (top-100), cross-encoder to re-rank to top-5. Cross-encoders process query+document together — slower but much more accurate.

8.3 Query Transformation

Rewrite queries before searching. HyDE: generate a hypothetical answer, embed that. Multi-query: generate reformulations, retrieve for each, merge. Step-back: abstract the question first.

8.4 Agentic RAG

Give the LLM a "search" tool — let it decide when and what to retrieve. Multi-step retrieval, query refinement, result combination. How modern AI assistants work.

8.5 Graph RAG

Build a knowledge graph from documents, traverse relationships during retrieval. Powerful for multi-hop questions: "Who manages the team that owns the billing service?"

8.6 Evaluation Frameworks

RAGAS: measures faithfulness, relevance, context precision/recall. Build custom eval sets from real user questions. A/B test retrieval strategies with real traffic.

09. Coding Exercises

Exercise 1 — Chunking Strategy Comparison

Chunk the sample docs using 3 strategies (fixed-size, sentence-aware, recursive from module 02). For 5 test queries, compare which strategy retrieves the most relevant chunks. Output a comparison table. Success: Identify which strategy performs best for your queries.

Exercise 2 — Hybrid Search

Extend the retriever: add BM25 keyword search (rank_bm25 library), implement Reciprocal Rank Fusion to merge result sets, compare hybrid vs. vector-only. Success: Query "alembic downgrade" returns results from keyword match AND semantic similarity.

Exercise 3 — RAG Evaluation Pipeline

Create 10 Q&A pairs from sample docs (ground truth). Run each through the RAG pipeline. Measure retrieval precision, recall, and generation faithfulness. Success: A JSON report showing quality metrics per question.

10. Architect-Level Questions

Q1: Your RAG system retrieves relevant chunks but the LLM still gives wrong answers. How do you debug this?
Think: examine prompt template, "lost in the middle" effect, chunk ordering, chunk count, model capability, temperature setting.
Q2: How would you implement access control in a multi-tenant RAG system?
Think: store permissions as chunk metadata, filter BEFORE vector search, never rely on the LLM for security boundaries, row-level security with pgvector.
Q3: 10M documents, growing 50K/day, queries under 500ms. Design the retrieval architecture.
Think: two-stage (HNSW top-100 → cross-encoder top-5), async ingestion via queue, incremental indexing, self-hosted embedding model, cache hot queries, monitor p99.
Q4: When would you choose fine-tuning over RAG, and when would you use both?
Think: RAG for changing facts, fine-tuning for style/vocabulary. Both together for domain reasoning + current data. Legal AI example.
Q5: How do you evaluate a RAG system in production beyond user satisfaction?
Think: retrieval precision@k, faithfulness, latency p50/p95/p99, token cost, cache hit rate, thumbs up/down, reformulated queries, golden dataset regression.