// Embeddings & Vector Search

The complete engineering guide — from vectors to production RAG pipelines

01. Core Concepts

What Are Embeddings?

An embedding is a dense numerical vector (array of floats) that represents the meaning of a piece of text. Think of it as a coordinate in a high-dimensional semantic space — texts with similar meanings land near each other.

"How do I reset my password?" → [0.021, -0.034, 0.118, ..., 0.045]  # 1536 floats
"I forgot my login credentials" → [0.019, -0.031, 0.121, ..., 0.042]  # very similar vector
"The weather is nice today"    → [0.872, 0.441, -0.203, ..., 0.667]  # very different vector

Key properties: Fixed-size — every input maps to the same dimensionality (e.g., 1536 for OpenAI, 1024 for Voyage AI). Semantic — "car" and "automobile" have nearby vectors. Continuous — small changes in meaning produce small changes in vector. Language-agnostic — "perro" (Spanish) lands near "dog" (English).

What Is Vector Search?

Vector search (a.k.a. semantic search) finds the most similar vectors in a collection to a given query vector. Instead of keyword matching (WHERE title LIKE '%password%'), you find documents by meaning. No keyword overlap needed.

Query:   "How do I change my login?" → vector → search
Results: "Password reset instructions" (similarity: 0.94)
         "Account credentials FAQ"     (similarity: 0.91)
         "Login troubleshooting guide" (similarity: 0.88)

The Backend Engineer Mental Model

If you know databases, this mapping will click instantly:

Traditional DB	Vector DB
`INSERT INTO docs (text)`	`INSERT INTO docs (text, embedding)`
`WHERE text LIKE '%keyword%'`	`ORDER BY cosine_similarity(embedding, q) DESC`
B-tree index	HNSW / IVF index
Exact match	Approximate nearest neighbor (ANN)

02. Why It Matters

LLMs have a knowledge cutoff and a context window limit. You can't stuff your entire knowledge base into a prompt. Embeddings + vector search let you retrieve only the relevant pieces and inject them into the prompt. This is Retrieval-Augmented Generation (RAG).

Where Companies Use This

Use Case	Example	Why Embeddings
Customer support bots	Zendesk, Intercom	Search KB by user intent, not keywords
Internal knowledge search	Notion AI, Confluence AI	"Find the design doc about auth migration"
Code search	GitHub Copilot, Sourcegraph	Find semantically similar code across repos
E-commerce recommendations	Amazon, Shopify	Products similar to what you're viewing
Legal document discovery	Harvey AI	Find relevant case law from millions of docs
Anomaly detection	Financial systems	Transactions far from normal patterns
RAG pipelines	Every production LLM app	Ground LLM responses in factual data

Why Not Just Full-Text Search?

Full-text search (Elasticsearch, PostgreSQL tsvector) fails when: the user's words differ from the document ("car" vs "vehicle"), intent matters more than keywords, cross-language search is needed, or you need to combine text with other modalities. In practice, production systems combine both: vector search for semantic recall + keyword search for precision (hybrid search).

03. Internal Mechanics

Embedding Model Pipeline

Modern embedding models are transformer encoders (BERT-based, not GPT decoder architecture):

Tokenization

Text is split into subword tokens
"embedding" → ["em", "bed", "ding"]

↓

Transformer Encoding

Each token passes through self-attention + feed-forward layers
Each token gets a contextualized representation

↓

Pooling

Token representations → single vector
Mean pooling (most common) or [CLS] token

↓

L2 Normalization

Vector normalized to unit length
cosine similarity = dot product (after normalization)

Similarity Metrics

Cosine Similarity

cos(a,b) = (a·b) / (||a||×||b||)
Range: [-1, 1]
1 = identical direction
Most common for text

Euclidean (L2)

d(a,b) = √(Σ(ai - bi)²)
Range: [0, ∞)
0 = identical
Good for dense clusters

Dot Product

dot(a,b) = Σ(ai × bi)
When normalized:
dot = cosine similarity
Fastest to compute

For normalized embeddings (which most APIs return), all three give equivalent rankings. Cosine similarity is the standard for text.

Vector Search Algorithms

Algorithm	How It Works	Complexity	Best For
Flat (Brute Force)	Compare query against every vector	O(n × d)	< 100K vectors, perfect recall
IVF	Cluster vectors via k-means, search only nearest clusters	O(nprobe × n/nlist × d)	100K–10M vectors
HNSW	Multi-layer graph, greedy navigation top→bottom	O(log n)	Production standard (pgvector, Pinecone, Qdrant)
PQ	Compress vectors: split + quantize subvectors	Reduces memory 32x+	Billion-scale (combined with IVF)

HNSW — The Dominant Algorithm

Layer 3:  A ---------> D                    (sparse, long jumps)
Layer 2:  A ----> C --> D ----> F            (medium density)
Layer 1:  A -> B -> C -> D -> E -> F -> G    (dense, local connections)
Layer 0:  [all vectors with local neighbors]  (full graph)

The RAG Pipeline Architecture

┌──────────────────── INDEXING (offline) ────────────────────┐
│                                                             │
│  Documents → Chunk → Embed → Store in Vector DB             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

┌──────────────────── QUERY (online) ────────────────────────┐
│                                                             │
│  User Query → Embed → Vector Search → Top-K chunks          │
│                          ↓                                  │
│               Chunks + Query → LLM Prompt → Response        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

04. Practical Example — Internal Doc Search

Problem: Your company has 5,000 internal docs (engineering runbooks, product specs, HR policies). Engineers ask questions in Slack, and the answers exist somewhere — but nobody can find them.

Solution: Build a RAG-powered search system.

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Docs (MD,  │────→│  Chunking +  │────→│  Vector DB   │
│  PDF, HTML) │     │  Embedding   │     │  (FAISS)     │
└─────────────┘     └──────────────┘     └──────────────┘
                                               │
┌─────────────┐     ┌──────────────┐           │
│  User Query │────→│  Embed Query │──── search ┘
│             │     └──────────────┘
│             │            │
│             │     ┌──────────────┐     ┌──────────────┐
│             │     │  Top-K Chunks│────→│  Claude LLM  │
│             │     └──────────────┘     │  + Context   │
│             │                          └──────┬───────┘
│  Response   │←────────────────────────────────┘
└─────────────┘

Scale considerations: 5,000 docs → ~50,000 chunks → FAISS in-memory is fine (< 500MB RAM). At 100K+ docs → consider pgvector or a managed service (Pinecone, Weaviate). Need metadata filtering? → pgvector (SQL WHERE + vector search).

05. Hands-on Implementation

Code files: app.py (standalone demo), chunking.py (chunking strategies), vector_store.py (FAISS abstraction).

Quick Start

# Install dependencies (from repo root)
pip install -r requirements.txt

# Set your API keys
cp .env.example .env
# Edit .env with your ANTHROPIC_API_KEY and VOYAGE_API_KEY

# Run the standalone demo
cd 02-embeddings-vector-search && python app.py

Step 1: Generate Embeddings

We use Voyage AI's embedding model (purpose-built for embeddings, higher quality than general-purpose models for retrieval tasks). A local fallback using sentence-transformers is available for experimenting without API costs.

# The core operation — turning text into vectors
import voyageai

client = voyageai.Client()  # uses VOYAGE_API_KEY env var

response = client.embed(
    texts=["How do I reset my password?"],
    model="voyage-3-large",
)
embedding = response.embeddings[0]  # list of 1024 floats

Step 2: Chunk Your Documents

You can't embed a 50-page document as one vector — the meaning gets diluted. Split it into semantically meaningful chunks.

# See chunking.py for full implementation
from chunking import RecursiveChunker

chunker = RecursiveChunker(chunk_size=512, chunk_overlap=50)
chunks = chunker.chunk(long_document_text)
# Each chunk is ~512 tokens with 50-token overlap for context continuity

Strategy	Pros	Cons	Use When
Fixed-size (by tokens)	Simple, predictable	May cut mid-sentence	Uniform content
Sentence-aware	Preserves meaning	Variable chunk sizes	Prose, documentation
Recursive (by separators)	Respects doc structure	More complex	Markdown, code, structured
Semantic (by meaning shift)	Best quality	Expensive (needs embedding)	High-value content

Step 3: Build the Vector Index

# See vector_store.py for full implementation
from vector_store import FaissVectorStore

store = FaissVectorStore(dimension=1024)

# Index your chunks
for chunk in chunks:
    embedding = embed(chunk.text)
    store.add(embedding, metadata={"text": chunk.text, "source": chunk.source})

# Persist to disk
store.save("./index_data")

Step 4: Search by Meaning

query_embedding = embed("How do I change my login credentials?")
results = store.search(query_embedding, top_k=5)

# Returns chunks about password resets, account settings, etc.
# — even though the query doesn't contain those exact words

Step 5: RAG — Augment the LLM

# Build the prompt with retrieved context
context = "\n\n".join([r["text"] for r in results])
prompt = f"""Based on the following documentation, answer the user's question.

Documentation:
{context}

Question: {user_query}

Answer based only on the provided documentation. If the answer isn't in the docs, say so."""

# Send to Claude
response = anthropic_client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
)

06. System Design — Production Architecture

┌────────────────────────────────────────────────────────────────┐
│                        API Gateway                              │
└──────────────┬──────────────────────────────────┬──────────────┘
               │                                  │
    ┌──────────▼──────────┐           ┌───────────▼───────────┐
    │   Ingestion Service  │           │    Query Service       │
    │                      │           │                        │
    │  • Receive documents │           │  • Embed user query    │
    │  • Chunk text        │           │  • Vector search       │
    │  • Generate embeds   │           │  • Re-rank results     │
    │  • Store in vector DB│           │  • Build LLM prompt    │
    │  • Store raw in S3   │           │  • Stream response     │
    └──────────┬───────────┘           └───────────┬────────────┘
               │                                   │
    ┌──────────▼───────────────────────────────────▼────────────┐
    │                     Vector Database                        │
    │  (pgvector / Pinecone / Qdrant / Weaviate)                │
    │                                                            │
    │  Stores: embedding vector + metadata + chunk text          │
    └───────────────────────────┬───────────────────────────────┘
               │
    ┌──────────▼──────────┐
    │  Object Store (S3)   │
    │  Original documents  │
    └─────────────────────┘

Scaling Decisions

Scale	Vector Store	Why
< 100K vectors	FAISS in-memory	Simple, fast, no infra
100K – 10M	pgvector (PostgreSQL)	SQL filtering + vectors, one DB to manage
10M – 100M	Qdrant / Weaviate	Purpose-built, better perf at scale
100M+	Pinecone / custom sharding	Managed, distributed, billion-scale

Key Production Concerns

Embedding model versioning: If you change models, ALL vectors must be re-embedded. Version your indexes.
Chunking strategy is the #1 lever for RAG quality — not the LLM model, not the vector DB.
Hybrid search: Combine vector search (recall) + BM25/keyword search (precision) for best results.
Re-ranking: Use a cross-encoder model to re-rank top-K results before sending to the LLM.
Caching: Cache embeddings for repeated queries. Cache LLM responses for identical query+context pairs.
Monitoring: Track retrieval quality (right chunks returned?) separately from generation quality.

pgvector — The Backend Engineer's Sweet Spot

If you're already running PostgreSQL (and as a backend dev, you probably are), pgvector gives you vector search without new infrastructure:

-- Enable the extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1024),  -- matches your embedding model's dimensions
    metadata JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for fast search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Semantic search with metadata filtering
SELECT content, metadata,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'department' = 'engineering'
ORDER BY embedding <=> $1::vector
LIMIT 10;

07. Common Pitfalls

Critical Chunks too large or too small

Too large (> 1000 tokens): meaning gets diluted — a chunk about "auth AND caching AND deployment" won't match any topic well. Too small (< 100 tokens): loses context — "Use the --force flag" means nothing alone. Fix: 256–512 tokens with 10-15% overlap.

Critical Not evaluating retrieval quality

Engineers focus on LLM output quality but ignore retrieval. If the wrong chunks are retrieved, the best LLM can't help. Fix: Build a retrieval eval set: 50+ (query, expected_document) pairs. Measure recall@5, recall@10, MRR. This matters more than evaluating the LLM.

Critical No similarity threshold

Vector search ALWAYS returns results — even when nothing relevant exists. A cosine similarity of 0.3 looks like a "match" but it's garbage. Fix: Set a threshold (e.g., 0.7). Below that, return "I don't have information about this" instead of hallucinating from irrelevant context.

Important Ignoring metadata filtering

Pure vector search returns "most similar" text, which might be from the wrong department, outdated, or wrong language. Fix: Always store and filter on metadata (source, date, department, access level). Pre-filter, then vector search within that subset.

Important Using the wrong embedding model

General-purpose embedding models aren't optimized for your domain. Specialized models exist for code, legal text, medical text. Fix: Benchmark on YOUR data with 50 real queries. The MTEB leaderboard shows general benchmarks — your domain may differ.

Tip Embedding model / query asymmetry

Some models (like E5 family) require prefixes — "query: " for queries and "passage: " for documents. Missing this kills performance silently. Fix: Always read the model card before building the pipeline.

Tip Re-embedding everything on every update

Adding one document shouldn't require re-processing all documents. Fix: Design for incremental updates. Store document hashes. Only re-embed changed content. Use the vector store's upsert operations.

08. Coding Exercises

Exercise 1 — Multi-Source RAG Pipeline

Build a system that accepts documents from multiple sources (plain text, markdown, JSON), uses different chunking strategies based on document type, stores embeddings with source metadata in FAISS, and implements search with source filtering + RAG with Claude. Success: Search for "deployment process" returns results from markdown runbooks even without keyword overlap.

Exercise 2 — Hybrid Search

Extend the base implementation: add BM25 keyword search alongside vector search (use rank_bm25), implement Reciprocal Rank Fusion (RRF) to merge result sets, and build an A/B comparison showing vector-only vs. hybrid results. Success: Query "ERR_CONNECTION_REFUSED troubleshooting" returns both that specific error (keyword) AND general network debugging guides (semantic).

Exercise 3 — Retrieval Evaluation Harness

Build a retrieval quality evaluation system: create a test dataset (20 queries with expected documents), implement recall@k, precision@k, and MRR metrics, compare at least two configurations (e.g., chunk_size=256 vs chunk_size=512). Success: Produce a report identifying the better chunking strategy for your specific data.

09. Architect-Level Questions

Q1: You're running a RAG system with 10M embedded documents. The embedding model provider releases a significantly better model. How do you migrate without downtime?

Think: blue-green indexing, you can't mix vectors from different models, compute cost planning, evaluation before switching, index versioning (model_name + version as metadata).

Q2: Your knowledge base has API docs (structured), support tickets (conversational), and legal contracts (formal). How do you design your chunking strategy?

Think: different content types need different strategies — API docs by endpoint, tickets by conversation turn, contracts by clause. Store strategy as metadata. Evaluate each independently.

Q3: Users complain the chatbot gives confident but wrong answers. The LLM works correctly — the problem is upstream. How do you diagnose retrieval quality issues?

Think: log retrieved chunks alongside responses, build eval set from failing queries, check similarity scores, adjust chunk size / add metadata filtering / implement re-ranking / set minimum thresholds.

Q4: Design a RAG system: P99 latency under 500ms, 1,000 QPS, 50M documents. Walk through your architecture.

Think: distributed vector DB, self-hosted embedding model (cut network latency), HNSW ~5-10ms at 50M, LLM is the bottleneck — streaming + caching + smaller models for common queries, tiered approach.

Q5: Beyond RAG and document search, what production systems benefit from embeddings? Describe a non-obvious use case.

Think: anomaly detection (embed user sessions, flag outliers), content deduplication, A/B test analysis (cluster feedback text), cache invalidation (compare old vs new embedding), feature engineering for ML models.