The complete engineering guide — from vectors to production RAG pipelines
An embedding is a dense numerical vector (array of floats) that represents the meaning of a piece of text. Think of it as a coordinate in a high-dimensional semantic space — texts with similar meanings land near each other.
Key properties: Fixed-size — every input maps to the same dimensionality (e.g., 1536 for OpenAI, 1024 for Voyage AI). Semantic — "car" and "automobile" have nearby vectors. Continuous — small changes in meaning produce small changes in vector. Language-agnostic — "perro" (Spanish) lands near "dog" (English).
Vector search (a.k.a. semantic search) finds the most similar vectors in a collection to a given query vector. Instead of keyword matching (WHERE title LIKE '%password%'), you find documents by meaning. No keyword overlap needed.
If you know databases, this mapping will click instantly:
| Traditional DB | Vector DB |
|---|---|
INSERT INTO docs (text) | INSERT INTO docs (text, embedding) |
WHERE text LIKE '%keyword%' | ORDER BY cosine_similarity(embedding, q) DESC |
| B-tree index | HNSW / IVF index |
| Exact match | Approximate nearest neighbor (ANN) |
LLMs have a knowledge cutoff and a context window limit. You can't stuff your entire knowledge base into a prompt. Embeddings + vector search let you retrieve only the relevant pieces and inject them into the prompt. This is Retrieval-Augmented Generation (RAG).
| Use Case | Example | Why Embeddings |
|---|---|---|
| Customer support bots | Zendesk, Intercom | Search KB by user intent, not keywords |
| Internal knowledge search | Notion AI, Confluence AI | "Find the design doc about auth migration" |
| Code search | GitHub Copilot, Sourcegraph | Find semantically similar code across repos |
| E-commerce recommendations | Amazon, Shopify | Products similar to what you're viewing |
| Legal document discovery | Harvey AI | Find relevant case law from millions of docs |
| Anomaly detection | Financial systems | Transactions far from normal patterns |
| RAG pipelines | Every production LLM app | Ground LLM responses in factual data |
Full-text search (Elasticsearch, PostgreSQL tsvector) fails when: the user's words differ from the document ("car" vs "vehicle"), intent matters more than keywords, cross-language search is needed, or you need to combine text with other modalities. In practice, production systems combine both: vector search for semantic recall + keyword search for precision (hybrid search).
Modern embedding models are transformer encoders (BERT-based, not GPT decoder architecture):
cos(a,b) = (a·b) / (||a||×||b||)
Range: [-1, 1]
1 = identical direction
Most common for text
d(a,b) = √(Σ(ai - bi)²)
Range: [0, ∞)
0 = identical
Good for dense clusters
dot(a,b) = Σ(ai × bi)
When normalized:
dot = cosine similarity
Fastest to compute
For normalized embeddings (which most APIs return), all three give equivalent rankings. Cosine similarity is the standard for text.
| Algorithm | How It Works | Complexity | Best For |
|---|---|---|---|
| Flat (Brute Force) | Compare query against every vector | O(n × d) | < 100K vectors, perfect recall |
| IVF | Cluster vectors via k-means, search only nearest clusters | O(nprobe × n/nlist × d) | 100K–10M vectors |
| HNSW | Multi-layer graph, greedy navigation top→bottom | O(log n) | Production standard (pgvector, Pinecone, Qdrant) |
| PQ | Compress vectors: split + quantize subvectors | Reduces memory 32x+ | Billion-scale (combined with IVF) |
┌──────────────────── INDEXING (offline) ────────────────────┐ │ │ │ Documents → Chunk → Embed → Store in Vector DB │ │ │ └─────────────────────────────────────────────────────────────┘ ┌──────────────────── QUERY (online) ────────────────────────┐ │ │ │ User Query → Embed → Vector Search → Top-K chunks │ │ ↓ │ │ Chunks + Query → LLM Prompt → Response │ │ │ └─────────────────────────────────────────────────────────────┘
Problem: Your company has 5,000 internal docs (engineering runbooks, product specs, HR policies). Engineers ask questions in Slack, and the answers exist somewhere — but nobody can find them.
Solution: Build a RAG-powered search system.
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Docs (MD, │────→│ Chunking + │────→│ Vector DB │
│ PDF, HTML) │ │ Embedding │ │ (FAISS) │
└─────────────┘ └──────────────┘ └──────────────┘
│
┌─────────────┐ ┌──────────────┐ │
│ User Query │────→│ Embed Query │──── search ┘
│ │ └──────────────┘
│ │ │
│ │ ┌──────────────┐ ┌──────────────┐
│ │ │ Top-K Chunks│────→│ Claude LLM │
│ │ └──────────────┘ │ + Context │
│ │ └──────┬───────┘
│ Response │←────────────────────────────────┘
└─────────────┘
Scale considerations: 5,000 docs → ~50,000 chunks → FAISS in-memory is fine (< 500MB RAM). At 100K+ docs → consider pgvector or a managed service (Pinecone, Weaviate). Need metadata filtering? → pgvector (SQL WHERE + vector search).
Code files: app.py (standalone demo), chunking.py (chunking strategies), vector_store.py (FAISS abstraction).
We use Voyage AI's embedding model (purpose-built for embeddings, higher quality than general-purpose models for retrieval tasks). A local fallback using sentence-transformers is available for experimenting without API costs.
You can't embed a 50-page document as one vector — the meaning gets diluted. Split it into semantically meaningful chunks.
| Strategy | Pros | Cons | Use When |
|---|---|---|---|
| Fixed-size (by tokens) | Simple, predictable | May cut mid-sentence | Uniform content |
| Sentence-aware | Preserves meaning | Variable chunk sizes | Prose, documentation |
| Recursive (by separators) | Respects doc structure | More complex | Markdown, code, structured |
| Semantic (by meaning shift) | Best quality | Expensive (needs embedding) | High-value content |
┌────────────────────────────────────────────────────────────────┐
│ API Gateway │
└──────────────┬──────────────────────────────────┬──────────────┘
│ │
┌──────────▼──────────┐ ┌───────────▼───────────┐
│ Ingestion Service │ │ Query Service │
│ │ │ │
│ • Receive documents │ │ • Embed user query │
│ • Chunk text │ │ • Vector search │
│ • Generate embeds │ │ • Re-rank results │
│ • Store in vector DB│ │ • Build LLM prompt │
│ • Store raw in S3 │ │ • Stream response │
└──────────┬───────────┘ └───────────┬────────────┘
│ │
┌──────────▼───────────────────────────────────▼────────────┐
│ Vector Database │
│ (pgvector / Pinecone / Qdrant / Weaviate) │
│ │
│ Stores: embedding vector + metadata + chunk text │
└───────────────────────────┬───────────────────────────────┘
│
┌──────────▼──────────┐
│ Object Store (S3) │
│ Original documents │
└─────────────────────┘
| Scale | Vector Store | Why |
|---|---|---|
| < 100K vectors | FAISS in-memory | Simple, fast, no infra |
| 100K – 10M | pgvector (PostgreSQL) | SQL filtering + vectors, one DB to manage |
| 10M – 100M | Qdrant / Weaviate | Purpose-built, better perf at scale |
| 100M+ | Pinecone / custom sharding | Managed, distributed, billion-scale |
If you're already running PostgreSQL (and as a backend dev, you probably are), pgvector gives you vector search without new infrastructure:
Too large (> 1000 tokens): meaning gets diluted — a chunk about "auth AND caching AND deployment" won't match any topic well. Too small (< 100 tokens): loses context — "Use the --force flag" means nothing alone. Fix: 256–512 tokens with 10-15% overlap.
Engineers focus on LLM output quality but ignore retrieval. If the wrong chunks are retrieved, the best LLM can't help. Fix: Build a retrieval eval set: 50+ (query, expected_document) pairs. Measure recall@5, recall@10, MRR. This matters more than evaluating the LLM.
Vector search ALWAYS returns results — even when nothing relevant exists. A cosine similarity of 0.3 looks like a "match" but it's garbage. Fix: Set a threshold (e.g., 0.7). Below that, return "I don't have information about this" instead of hallucinating from irrelevant context.
Pure vector search returns "most similar" text, which might be from the wrong department, outdated, or wrong language. Fix: Always store and filter on metadata (source, date, department, access level). Pre-filter, then vector search within that subset.
General-purpose embedding models aren't optimized for your domain. Specialized models exist for code, legal text, medical text. Fix: Benchmark on YOUR data with 50 real queries. The MTEB leaderboard shows general benchmarks — your domain may differ.
Some models (like E5 family) require prefixes — "query: " for queries and "passage: " for documents. Missing this kills performance silently. Fix: Always read the model card before building the pipeline.
Adding one document shouldn't require re-processing all documents. Fix: Design for incremental updates. Store document hashes. Only re-embed changed content. Use the vector store's upsert operations.
Build a system that accepts documents from multiple sources (plain text, markdown, JSON), uses different chunking strategies based on document type, stores embeddings with source metadata in FAISS, and implements search with source filtering + RAG with Claude. Success: Search for "deployment process" returns results from markdown runbooks even without keyword overlap.
Extend the base implementation: add BM25 keyword search alongside vector search (use rank_bm25), implement Reciprocal Rank Fusion (RRF) to merge result sets, and build an A/B comparison showing vector-only vs. hybrid results. Success: Query "ERR_CONNECTION_REFUSED troubleshooting" returns both that specific error (keyword) AND general network debugging guides (semantic).
Build a retrieval quality evaluation system: create a test dataset (20 queries with expected documents), implement recall@k, precision@k, and MRR metrics, compare at least two configurations (e.g., chunk_size=256 vs chunk_size=512). Success: Produce a report identifying the better chunking strategy for your specific data.