// RAG — Retrieval-Augmented Generation

Search first, then generate — grounding LLMs in your own data

01. Core Concepts

What Is RAG?

RAG is a pattern that augments an LLM's generation with external knowledge retrieved at query time. Instead of relying solely on what the model memorized during training, you fetch relevant documents from your own data store and inject them into the prompt as context.

User Query
    │
    ▼
┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  Retriever   │────▶│  Retrieved Docs  │────▶│   LLM       │
│ (vector DB)  │     │  (top-k chunks)  │     │ (generation) │
└─────────────┘     └──────────────────┘     └─────────────┘
                                                    │
                                                    ▼
                                              Final Answer
                                          (grounded in your data)

Think of it like a developer who, instead of answering from memory, first runs a search across the docs, reads the relevant sections, then gives you an informed answer.

The Key Insight

RAG separates knowledge (what the system knows) from reasoning (how it thinks). The LLM provides reasoning; your data store provides knowledge. This means: update knowledge without retraining, ground answers in verifiable sources, and control exactly what information the model can access.

Module 02 taught you how to embed text and search by meaning. RAG is what makes that useful — it's the pattern that turns vector search into an intelligent Q&A system by adding an LLM on top.

02. Why It Matters

Problem	How RAG Solves It
Knowledge cutoff — LLMs don't know your internal data	Retrieves current, private data at query time
Hallucination — LLMs fabricate facts	Grounds answers in actual source documents
Cost of fine-tuning — retraining is expensive and slow	Just update the document store, no retraining
Attribution — users need to verify claims	Cite the exact source chunks used
Data freshness — information changes daily	Re-index documents; LLM always sees latest data

Where Companies Use This

Use Case	Example	Why RAG
Customer support bots	Zendesk, Intercom	Search help center by intent, not keywords
Enterprise search	Glean, Notion AI	"Find the design doc about auth migration"
Code assistance	Cursor, GitHub Copilot	RAG over your codebase for context-aware help
Legal / compliance	Harvey AI	Query case law from millions of documents
Developer tools	Internal platforms	Query runbooks, API docs, incident history

Rule of thumb: If your LLM needs to answer questions about data it wasn't trained on, you need RAG.

03. Internal Mechanics

The Two Phases

Indexing (Offline)

Documents → Chunk → Embed → Store in Vector DB
Run once (or on a schedule) when source documents change

↓

Query (Online)

User Query → Embed → Vector Search → Top-K Chunks
→ Build Prompt → LLM → Grounded Answer

Step-by-Step Query Flow

Embed Query

Convert the user's question into a vector
using the same embedding model as indexing

↓

Vector Search

FAISS/HNSW finds the top-K most similar chunks
O(log n) with HNSW — fast even at millions of vectors

↓

Filter & Rank

Apply similarity threshold (remove garbage results)
Optional: metadata filter, reranking with cross-encoder

↓

Prompt Construction

Inject retrieved chunks into the LLM prompt
System prompt: "Answer ONLY from the context. Cite sources."

↓

LLM Generation

Claude reads context + question → generates grounded answer
Low temperature (0.1-0.3) for factual, faithful output

Retrieval Strategies

Sparse (BM25)

Keyword matching. "python error" matches exact term. Good for error codes, function names, identifiers.

Dense (Embeddings)

Semantic matching. "python error" matches "bug in my script". Good for intent, paraphrases, concepts.

Hybrid (Both)

Combine sparse + dense with Reciprocal Rank Fusion. Best of both worlds. Production standard.

Reranking

Cross-encoder rescores top-K candidates. More accurate than bi-encoder but too slow for first stage.

Chunking — The #1 Quality Lever

Strategy	How	Best For
Fixed-size	Split every N characters	Uniform content (logs)
Sentence-aware	Accumulate sentences up to limit	Prose, articles
Recursive (separators)	Split on \n\n → \n → . → space	Markdown, code, structured docs
Semantic	Split where meaning shifts	High-value content

Sweet spot: 256–512 tokens with 10-15% overlap. Recursive splitting is the production default.

04. Practical Example

Scenario: Internal engineering docs chatbot. 2,000+ pages across Confluence, Notion, GitHub wikis. Engineers waste hours searching.

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Docs (MD,  │────→│  Chunking +  │────→│  Vector DB   │
│  PDF, HTML) │     │  Embedding   │     │  (FAISS)     │
└─────────────┘     └──────────────┘     └──────────────┘
                                               │
┌─────────────┐     ┌──────────────┐           │
│  User Query │────→│  Embed Query │──── search ┘
│             │     └──────────────┘
│             │            │
│             │     ┌──────────────┐     ┌──────────────┐
│             │     │  Top-K Chunks│────→│  Claude LLM  │
│             │     └──────────────┘     │  + Context   │
│             │                          └──────┬───────┘
│  Response   │←────────────────────────────────┘
└─────────────┘

The demo in app.py implements exactly this pipeline using sample markdown docs about FastAPI, database patterns, and observability.

05. Hands-on Implementation

Code files: app.py (full RAG demo), chunking.py (text chunking), vector_store.py (FAISS abstraction) — both cloned from module 02. Sample docs in docs/.

Quick Start

# Install dependencies (from repo root)
pip install -r requirements.txt

# Set your API keys
cp .env.example .env
# Edit .env with ANTHROPIC_API_KEY (and optionally VOYAGE_API_KEY)

# Run the RAG demo
cd 04-rag && python app.py

Step 1: Load Documents

def load_documents(docs_dir: Path) -> list[tuple[str, str]]:
    docs = []
    for md_file in sorted(docs_dir.glob("**/*.md")):
        content = md_file.read_text(encoding="utf-8")
        docs.append((md_file.name, content))
    return docs

Step 2: Chunk Documents

from chunking import RecursiveChunker

chunker = RecursiveChunker(chunk_size=400, chunk_overlap=60)
chunks = chunker.chunk(document_text, source="fastapi_guide.md")
# Smaller chunks = more precise retrieval for RAG

Step 3: Embed & Build Index

# Embed all chunks in one batch (efficient)
texts = [c.text for c in all_chunks]
embeddings = embedder.embed(texts)

# Build FAISS index
store = FaissVectorStore(dimension=embedder.dimension)
store.add_batch(embeddings, metadata_list)

Step 4: Retrieve Relevant Chunks

query_embedding = embedder.embed([query])[0]
results = store.search(query_embedding, top_k=5, min_score=0.3)

# min_score is critical — without it, the system always returns results
# even when nothing is relevant, leading to hallucination

Step 5: Generate Grounded Answer with Claude

# Build context from retrieved chunks
context_parts = []
for i, r in enumerate(results, 1):
    source = r.metadata["source"]
    text = r.metadata["text"]
    context_parts.append(f"[Source {i}: {source}]\n{text}")
context = "\n\n---\n\n".join(context_parts)

# Generate with Claude
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="Answer ONLY based on the provided context. Cite your sources.",
    messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}],
)

06. System Design — Production Architecture

┌────────────────────────────────────────────────────────────────┐
│                        API Gateway                              │
│                  (Auth, Rate Limit, Routing)                    │
└──────────────┬──────────────────────────────────┬──────────────┘
               │                                  │
    ┌──────────▼──────────┐           ┌───────────▼───────────┐
    │  Ingestion Service  │           │   Query Service        │
    │                      │           │                        │
    │  • Receive documents │           │  • Embed user query    │
    │  • Detect changes    │           │  • Vector search       │
    │  • Chunk text        │           │  • Re-rank results     │
    │  • Generate embeds   │           │  • Build LLM prompt    │
    │  • Store in vector DB│           │  • Stream response     │
    │  • Track doc→chunk   │           │  • Return citations    │
    └──────────┬───────────┘           └───────────┬────────────┘
               │                                   │
    ┌──────────▼───────────────────────────────────▼────────────┐
    │                    Vector Database                        │
    │  (pgvector / Pinecone / Qdrant / Weaviate)                │
    │  Stores: vector + metadata + chunk text                    │
    └───────────────────────────────────────────────────────────┘

Scaling Decisions

Scale	Vector Store	Why
< 100K vectors	FAISS in-memory	Simple, fast, no infra
100K – 10M	pgvector (PostgreSQL)	SQL filtering + vectors, one DB to manage
10M – 100M	Qdrant / Weaviate	Purpose-built, better perf at scale
100M+	Pinecone / custom sharding	Managed, distributed, billion-scale

Key Production Concerns

Latency: Cache frequent queries in Redis; self-host embedding model to cut API latency
Cost: Smaller embedding models, cache LLM responses, batch embedding calls
Freshness: Incremental ingestion — hash docs, only re-embed changed ones
Access control: Store permissions as chunk metadata; filter at retrieval time (security boundary)
Observability: Log every query + retrieved chunks + response. Track retrieval precision.
Evaluation: Golden dataset of (question, expected_answer, expected_sources). Run on every change.

07. Common Pitfalls

Critical Bad chunking — the #1 mistake

Too large (> 1000 tokens): meaning diluted. Too small (< 100 tokens): context lost. Fix: 256–512 tokens with 10-15% overlap. Test with your actual queries. This is the most impactful tuning knob.

Critical Not evaluating retrieval separately

If retrieval returns the wrong chunks, the LLM cannot give the right answer. Most engineers only evaluate the final answer. Fix: Build a retrieval eval set: 50+ (query, expected_doc) pairs. Measure recall@5. This matters more than evaluating the LLM.

Critical No similarity threshold

Vector search ALWAYS returns results — even when nothing relevant exists. Fix: Set a min_score threshold. Below that, return "I don't have information about this" instead of hallucinating from irrelevant context.

Important Stuffing too many chunks

Retrieving 20 chunks drowns the signal in noise. "Lost in the middle" problem: LLMs pay less attention to middle content. Fix: Start with 3-5 chunks. Quality over quantity.

Important Embedding model mismatch

Using different embedding models for indexing vs. querying produces garbage. Vectors must be in the same space. Fix: Version your indexes with model name. Re-embed everything when switching models.

Important Not handling document updates

Deleting a source doc without removing its chunks creates "zombie" chunks — citing deleted documents. Fix: Track document → chunk mappings. Delete old chunks when re-indexing.

Tip Ignoring metadata filtering

Vector similarity alone isn't enough. Filter by department, date, access level before searching. Fix: Store metadata on every chunk. Pre-filter, then vector search.

08. Advanced Topics

8.1 Hybrid Search (BM25 + Vector)

Combine keyword search with vector search using Reciprocal Rank Fusion. Production standard — pure vector misses exact matches (error codes), pure keyword misses semantic matches.

8.2 Reranking with Cross-Encoders

Use bi-encoder for first-stage retrieval (top-100), cross-encoder to re-rank to top-5. Cross-encoders process query+document together — slower but much more accurate.

8.3 Query Transformation

Rewrite queries before searching. HyDE: generate a hypothetical answer, embed that. Multi-query: generate reformulations, retrieve for each, merge. Step-back: abstract the question first.

8.4 Agentic RAG

Give the LLM a "search" tool — let it decide when and what to retrieve. Multi-step retrieval, query refinement, result combination. How modern AI assistants work.

8.5 Graph RAG

Build a knowledge graph from documents, traverse relationships during retrieval. Powerful for multi-hop questions: "Who manages the team that owns the billing service?"

8.6 Evaluation Frameworks

RAGAS: measures faithfulness, relevance, context precision/recall. Build custom eval sets from real user questions. A/B test retrieval strategies with real traffic.

09. Coding Exercises

Exercise 1 — Chunking Strategy Comparison

Chunk the sample docs using 3 strategies (fixed-size, sentence-aware, recursive from module 02). For 5 test queries, compare which strategy retrieves the most relevant chunks. Output a comparison table. Success: Identify which strategy performs best for your queries.

Exercise 2 — Hybrid Search

Extend the retriever: add BM25 keyword search (rank_bm25 library), implement Reciprocal Rank Fusion to merge result sets, compare hybrid vs. vector-only. Success: Query "alembic downgrade" returns results from keyword match AND semantic similarity.

Exercise 3 — RAG Evaluation Pipeline

Create 10 Q&A pairs from sample docs (ground truth). Run each through the RAG pipeline. Measure retrieval precision, recall, and generation faithfulness. Success: A JSON report showing quality metrics per question.

10. Architect-Level Questions

Q1: Your RAG system retrieves relevant chunks but the LLM still gives wrong answers. How do you debug this?

Think: examine prompt template, "lost in the middle" effect, chunk ordering, chunk count, model capability, temperature setting.

Q2: How would you implement access control in a multi-tenant RAG system?

Think: store permissions as chunk metadata, filter BEFORE vector search, never rely on the LLM for security boundaries, row-level security with pgvector.

Q3: 10M documents, growing 50K/day, queries under 500ms. Design the retrieval architecture.

Think: two-stage (HNSW top-100 → cross-encoder top-5), async ingestion via queue, incremental indexing, self-hosted embedding model, cache hot queries, monitor p99.

Q4: When would you choose fine-tuning over RAG, and when would you use both?

Think: RAG for changing facts, fine-tuning for style/vocabulary. Both together for domain reasoning + current data. Legal AI example.

Q5: How do you evaluate a RAG system in production beyond user satisfaction?

Think: retrieval precision@k, faithfulness, latency p50/p95/p99, token cost, cache hit rate, thumbs up/down, reformulated queries, golden dataset regression.