// Embeddings & Vector Search

The complete engineering guide — from vectors to production RAG pipelines

01. Core Concepts

What Are Embeddings?

An embedding is a dense numerical vector (array of floats) that represents the meaning of a piece of text. Think of it as a coordinate in a high-dimensional semantic space — texts with similar meanings land near each other.

"How do I reset my password?" → [0.021, -0.034, 0.118, ..., 0.045] # 1536 floats "I forgot my login credentials" → [0.019, -0.031, 0.121, ..., 0.042] # very similar vector "The weather is nice today" → [0.872, 0.441, -0.203, ..., 0.667] # very different vector

Key properties: Fixed-size — every input maps to the same dimensionality (e.g., 1536 for OpenAI, 1024 for Voyage AI). Semantic — "car" and "automobile" have nearby vectors. Continuous — small changes in meaning produce small changes in vector. Language-agnostic — "perro" (Spanish) lands near "dog" (English).

What Is Vector Search?

Vector search (a.k.a. semantic search) finds the most similar vectors in a collection to a given query vector. Instead of keyword matching (WHERE title LIKE '%password%'), you find documents by meaning. No keyword overlap needed.

Query: "How do I change my login?" → vector → search Results: "Password reset instructions" (similarity: 0.94) "Account credentials FAQ" (similarity: 0.91) "Login troubleshooting guide" (similarity: 0.88)

The Backend Engineer Mental Model

If you know databases, this mapping will click instantly:

Traditional DBVector DB
INSERT INTO docs (text)INSERT INTO docs (text, embedding)
WHERE text LIKE '%keyword%'ORDER BY cosine_similarity(embedding, q) DESC
B-tree indexHNSW / IVF index
Exact matchApproximate nearest neighbor (ANN)

02. Why It Matters

LLMs have a knowledge cutoff and a context window limit. You can't stuff your entire knowledge base into a prompt. Embeddings + vector search let you retrieve only the relevant pieces and inject them into the prompt. This is Retrieval-Augmented Generation (RAG).

Where Companies Use This

Use CaseExampleWhy Embeddings
Customer support botsZendesk, IntercomSearch KB by user intent, not keywords
Internal knowledge searchNotion AI, Confluence AI"Find the design doc about auth migration"
Code searchGitHub Copilot, SourcegraphFind semantically similar code across repos
E-commerce recommendationsAmazon, ShopifyProducts similar to what you're viewing
Legal document discoveryHarvey AIFind relevant case law from millions of docs
Anomaly detectionFinancial systemsTransactions far from normal patterns
RAG pipelinesEvery production LLM appGround LLM responses in factual data

Why Not Just Full-Text Search?

Full-text search (Elasticsearch, PostgreSQL tsvector) fails when: the user's words differ from the document ("car" vs "vehicle"), intent matters more than keywords, cross-language search is needed, or you need to combine text with other modalities. In practice, production systems combine both: vector search for semantic recall + keyword search for precision (hybrid search).

03. Internal Mechanics

Embedding Model Pipeline

Modern embedding models are transformer encoders (BERT-based, not GPT decoder architecture):

1
Tokenization
Text is split into subword tokens
"embedding" → ["em", "bed", "ding"]
2
Transformer Encoding
Each token passes through self-attention + feed-forward layers
Each token gets a contextualized representation
3
Pooling
Token representations → single vector
Mean pooling (most common) or [CLS] token
4
L2 Normalization
Vector normalized to unit length
cosine similarity = dot product (after normalization)

Similarity Metrics

Cosine Similarity

cos(a,b) = (a·b) / (||a||×||b||)
Range: [-1, 1]
1 = identical direction
Most common for text

Euclidean (L2)

d(a,b) = √(Σ(ai - bi)²)
Range: [0, ∞)
0 = identical
Good for dense clusters

Dot Product

dot(a,b) = Σ(ai × bi)
When normalized:
dot = cosine similarity
Fastest to compute

For normalized embeddings (which most APIs return), all three give equivalent rankings. Cosine similarity is the standard for text.

Vector Search Algorithms

AlgorithmHow It WorksComplexityBest For
Flat (Brute Force) Compare query against every vector O(n × d) < 100K vectors, perfect recall
IVF Cluster vectors via k-means, search only nearest clusters O(nprobe × n/nlist × d) 100K–10M vectors
HNSW Multi-layer graph, greedy navigation top→bottom O(log n) Production standard (pgvector, Pinecone, Qdrant)
PQ Compress vectors: split + quantize subvectors Reduces memory 32x+ Billion-scale (combined with IVF)

HNSW — The Dominant Algorithm

Layer 3: A ---------> D (sparse, long jumps) Layer 2: A ----> C --> D ----> F (medium density) Layer 1: A -> B -> C -> D -> E -> F -> G (dense, local connections) Layer 0: [all vectors with local neighbors] (full graph)

The RAG Pipeline Architecture

┌──────────────────── INDEXING (offline) ────────────────────┐
│                                                             │
│  Documents → Chunk → Embed → Store in Vector DB             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

┌──────────────────── QUERY (online) ────────────────────────┐
│                                                             │
│  User Query → Embed → Vector Search → Top-K chunks          │
│                          ↓                                  │
│               Chunks + Query → LLM Prompt → Response        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

04. Practical Example — Internal Doc Search

Problem: Your company has 5,000 internal docs (engineering runbooks, product specs, HR policies). Engineers ask questions in Slack, and the answers exist somewhere — but nobody can find them.

Solution: Build a RAG-powered search system.

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Docs (MD,  │────→│  Chunking +  │────→│  Vector DB   │
│  PDF, HTML) │     │  Embedding   │     │  (FAISS)     │
└─────────────┘     └──────────────┘     └──────────────┘
                                               │
┌─────────────┐     ┌──────────────┐           │
│  User Query │────→│  Embed Query │──── search ┘
│             │     └──────────────┘
│             │            │
│             │     ┌──────────────┐     ┌──────────────┐
│             │     │  Top-K Chunks│────→│  Claude LLM  │
│             │     └──────────────┘     │  + Context   │
│             │                          └──────┬───────┘
│  Response   │←────────────────────────────────┘
└─────────────┘

Scale considerations: 5,000 docs → ~50,000 chunks → FAISS in-memory is fine (< 500MB RAM). At 100K+ docs → consider pgvector or a managed service (Pinecone, Weaviate). Need metadata filtering? → pgvector (SQL WHERE + vector search).

05. Hands-on Implementation

Code files: app.py (standalone demo), chunking.py (chunking strategies), vector_store.py (FAISS abstraction).

Quick Start

# Install dependencies (from repo root) pip install -r requirements.txt # Set your API keys cp .env.example .env # Edit .env with your ANTHROPIC_API_KEY and VOYAGE_API_KEY # Run the standalone demo cd 02-embeddings-vector-search && python app.py

Step 1: Generate Embeddings

We use Voyage AI's embedding model (purpose-built for embeddings, higher quality than general-purpose models for retrieval tasks). A local fallback using sentence-transformers is available for experimenting without API costs.

# The core operation — turning text into vectors import voyageai client = voyageai.Client() # uses VOYAGE_API_KEY env var response = client.embed( texts=["How do I reset my password?"], model="voyage-3-large", ) embedding = response.embeddings[0] # list of 1024 floats

Step 2: Chunk Your Documents

You can't embed a 50-page document as one vector — the meaning gets diluted. Split it into semantically meaningful chunks.

# See chunking.py for full implementation from chunking import RecursiveChunker chunker = RecursiveChunker(chunk_size=512, chunk_overlap=50) chunks = chunker.chunk(long_document_text) # Each chunk is ~512 tokens with 50-token overlap for context continuity
StrategyProsConsUse When
Fixed-size (by tokens)Simple, predictableMay cut mid-sentenceUniform content
Sentence-awarePreserves meaningVariable chunk sizesProse, documentation
Recursive (by separators)Respects doc structureMore complexMarkdown, code, structured
Semantic (by meaning shift)Best qualityExpensive (needs embedding)High-value content

Step 3: Build the Vector Index

# See vector_store.py for full implementation from vector_store import FaissVectorStore store = FaissVectorStore(dimension=1024) # Index your chunks for chunk in chunks: embedding = embed(chunk.text) store.add(embedding, metadata={"text": chunk.text, "source": chunk.source}) # Persist to disk store.save("./index_data")

Step 4: Search by Meaning

query_embedding = embed("How do I change my login credentials?") results = store.search(query_embedding, top_k=5) # Returns chunks about password resets, account settings, etc. # — even though the query doesn't contain those exact words

Step 5: RAG — Augment the LLM

# Build the prompt with retrieved context context = "\n\n".join([r["text"] for r in results]) prompt = f"""Based on the following documentation, answer the user's question. Documentation: {context} Question: {user_query} Answer based only on the provided documentation. If the answer isn't in the docs, say so.""" # Send to Claude response = anthropic_client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}], )

06. System Design — Production Architecture

┌────────────────────────────────────────────────────────────────┐
│                        API Gateway                              │
└──────────────┬──────────────────────────────────┬──────────────┘
               │                                  │
    ┌──────────▼──────────┐           ┌───────────▼───────────┐
    │   Ingestion Service  │           │    Query Service       │
    │                      │           │                        │
    │  • Receive documents │           │  • Embed user query    │
    │  • Chunk text        │           │  • Vector search       │
    │  • Generate embeds   │           │  • Re-rank results     │
    │  • Store in vector DB│           │  • Build LLM prompt    │
    │  • Store raw in S3   │           │  • Stream response     │
    └──────────┬───────────┘           └───────────┬────────────┘
               │                                   │
    ┌──────────▼───────────────────────────────────▼────────────┐
    │                     Vector Database                        │
    │  (pgvector / Pinecone / Qdrant / Weaviate)                │
    │                                                            │
    │  Stores: embedding vector + metadata + chunk text          │
    └───────────────────────────┬───────────────────────────────┘
               │
    ┌──────────▼──────────┐
    │  Object Store (S3)   │
    │  Original documents  │
    └─────────────────────┘

Scaling Decisions

ScaleVector StoreWhy
< 100K vectorsFAISS in-memorySimple, fast, no infra
100K – 10Mpgvector (PostgreSQL)SQL filtering + vectors, one DB to manage
10M – 100MQdrant / WeaviatePurpose-built, better perf at scale
100M+Pinecone / custom shardingManaged, distributed, billion-scale

Key Production Concerns

pgvector — The Backend Engineer's Sweet Spot

If you're already running PostgreSQL (and as a backend dev, you probably are), pgvector gives you vector search without new infrastructure:

-- Enable the extension CREATE EXTENSION vector; -- Create table with vector column CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT, embedding vector(1024), -- matches your embedding model's dimensions metadata JSONB, created_at TIMESTAMPTZ DEFAULT NOW() ); -- Create HNSW index for fast search CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); -- Semantic search with metadata filtering SELECT content, metadata, 1 - (embedding <=> $1::vector) AS similarity FROM documents WHERE metadata->>'department' = 'engineering' ORDER BY embedding <=> $1::vector LIMIT 10;

07. Common Pitfalls

Critical Chunks too large or too small

Too large (> 1000 tokens): meaning gets diluted — a chunk about "auth AND caching AND deployment" won't match any topic well. Too small (< 100 tokens): loses context — "Use the --force flag" means nothing alone. Fix: 256–512 tokens with 10-15% overlap.

Critical Not evaluating retrieval quality

Engineers focus on LLM output quality but ignore retrieval. If the wrong chunks are retrieved, the best LLM can't help. Fix: Build a retrieval eval set: 50+ (query, expected_document) pairs. Measure recall@5, recall@10, MRR. This matters more than evaluating the LLM.

Critical No similarity threshold

Vector search ALWAYS returns results — even when nothing relevant exists. A cosine similarity of 0.3 looks like a "match" but it's garbage. Fix: Set a threshold (e.g., 0.7). Below that, return "I don't have information about this" instead of hallucinating from irrelevant context.

Important Ignoring metadata filtering

Pure vector search returns "most similar" text, which might be from the wrong department, outdated, or wrong language. Fix: Always store and filter on metadata (source, date, department, access level). Pre-filter, then vector search within that subset.

Important Using the wrong embedding model

General-purpose embedding models aren't optimized for your domain. Specialized models exist for code, legal text, medical text. Fix: Benchmark on YOUR data with 50 real queries. The MTEB leaderboard shows general benchmarks — your domain may differ.

Tip Embedding model / query asymmetry

Some models (like E5 family) require prefixes — "query: " for queries and "passage: " for documents. Missing this kills performance silently. Fix: Always read the model card before building the pipeline.

Tip Re-embedding everything on every update

Adding one document shouldn't require re-processing all documents. Fix: Design for incremental updates. Store document hashes. Only re-embed changed content. Use the vector store's upsert operations.

08. Coding Exercises

Exercise 1 — Multi-Source RAG Pipeline

Build a system that accepts documents from multiple sources (plain text, markdown, JSON), uses different chunking strategies based on document type, stores embeddings with source metadata in FAISS, and implements search with source filtering + RAG with Claude. Success: Search for "deployment process" returns results from markdown runbooks even without keyword overlap.

Exercise 2 — Hybrid Search

Extend the base implementation: add BM25 keyword search alongside vector search (use rank_bm25), implement Reciprocal Rank Fusion (RRF) to merge result sets, and build an A/B comparison showing vector-only vs. hybrid results. Success: Query "ERR_CONNECTION_REFUSED troubleshooting" returns both that specific error (keyword) AND general network debugging guides (semantic).

Exercise 3 — Retrieval Evaluation Harness

Build a retrieval quality evaluation system: create a test dataset (20 queries with expected documents), implement recall@k, precision@k, and MRR metrics, compare at least two configurations (e.g., chunk_size=256 vs chunk_size=512). Success: Produce a report identifying the better chunking strategy for your specific data.

09. Architect-Level Questions

Q1: You're running a RAG system with 10M embedded documents. The embedding model provider releases a significantly better model. How do you migrate without downtime?
Think: blue-green indexing, you can't mix vectors from different models, compute cost planning, evaluation before switching, index versioning (model_name + version as metadata).
Q2: Your knowledge base has API docs (structured), support tickets (conversational), and legal contracts (formal). How do you design your chunking strategy?
Think: different content types need different strategies — API docs by endpoint, tickets by conversation turn, contracts by clause. Store strategy as metadata. Evaluate each independently.
Q3: Users complain the chatbot gives confident but wrong answers. The LLM works correctly — the problem is upstream. How do you diagnose retrieval quality issues?
Think: log retrieved chunks alongside responses, build eval set from failing queries, check similarity scores, adjust chunk size / add metadata filtering / implement re-ranking / set minimum thresholds.
Q4: Design a RAG system: P99 latency under 500ms, 1,000 QPS, 50M documents. Walk through your architecture.
Think: distributed vector DB, self-hosted embedding model (cut network latency), HNSW ~5-10ms at 50M, LLM is the bottleneck — streaming + caching + smaller models for common queries, tiered approach.
Q5: Beyond RAG and document search, what production systems benefit from embeddings? Describe a non-obvious use case.
Think: anomaly detection (embed user sessions, flag outliers), content deduplication, A/B test analysis (cluster feedback text), cache invalidation (compare old vs new embedding), feature engineering for ML models.