// How LLMs Work

The complete engineering guide — from tokens to production architecture

01. Core Concepts

A Large Language Model is a neural network trained on one objective: predict the next token in a sequence. That deceptively simple goal, at massive scale (billions of parameters, trillions of training tokens), produces emergent capabilities like reasoning, summarization, and code generation.

Tokens, Not Words

Text is split into subword units via BPE (Byte Pair Encoding). "unhappiness" becomes ["un", "happi", "ness"] (3 tokens). You pay per token. JSON is expensive (braces, quotes, colons each consume tokens). A typical vocabulary: 32K-200K tokens.

Embeddings

Each token ID maps to a high-dimensional vector (e.g., 4096-d). These encode semantic meaning: king - man + woman ~ queen happens in this space. Embeddings bridge discrete text and continuous math.

The Transformer

The Transformer architecture (2017, "Attention Is All You Need") replaced RNNs with self-attention -- every token looks at every other token in parallel. This parallelism enables efficient GPU training at scale.

Autoregressive Generation

At inference, the model generates one token at a time. Each new token depends on all previous tokens. This is why: responses stream token-by-token, longer outputs take longer, and you cannot skip ahead.

# Autoregressive generation step by step
Input:     "The capital of France is"
Step 1:  -> "Paris"       (highest probability next token)
Step 2:  -> "."           (appended, model runs again)
Step 3:  -> <end>         (stop token emitted)

Final:     "The capital of France is Paris."

02. Why It Matters in Real Systems

Understanding LLM internals drives architectural and business decisions, not just academic curiosity.

Concern	The Engineering Reality	Decision
Latency	500-token response at 50ms/token = 25s wall-clock. Streaming shows first token in ~500ms.	Always stream in user-facing apps.
Cost	Verbose JSON (800 tokens) vs optimized prompt (200 tokens) = 75% savings. At 1M requests that's $1,800.	Measure token usage per endpoint. Route simple tasks to cheaper models.
Context	Attention is O(n^2) in memory. 200K token window exists but filling it is expensive.	Send minimum context. Use RAG instead of dumping docs.
Hallucination	Model predicts plausible tokens, not true ones. Confidently wrong on rare facts.	Never use raw output as truth. Use RAG, tool use, validation layers.
Prompts	System prompt + user message = "left context" shifting probability distribution.	Treat prompts as code: version, test, review in PRs.

Company Type	How They Apply LLM Internals
SaaS products	Token budgeting per tenant, model routing by task complexity
Customer support	Streaming for chat UX, structured output for ticket classification
Developer tools	Context window management for code completion (relevant files only)
Search engines	RAG pipelines -- embed, retrieve, generate with grounding
Healthcare / Legal	Heavy validation layers because hallucination is unacceptable

03. Internal Mechanics

Modern LLMs (GPT, Claude, Llama) use a decoder-only Transformer. Here is the complete data flow from input text to next-token prediction.

Transformer Architecture (Decoder-Only)

Input Text: "The cat sat"
       |
       v
+-----------------------+
|      TOKENIZER        |   "The cat sat" -> [464, 3797, 3290]
|     (BPE / SPM)       |   Deterministic, not neural
+-----------+-----------+
            v
+-----------------------+
|   TOKEN EMBEDDING     |   token_id 464 -> vector in R^d_model
|   + POSITIONAL        |   Adds position info (RoPE or learned)
|     ENCODING          |
+-----------+-----------+
            v
+-------------------------------------------------------+
|        TRANSFORMER BLOCK  (repeated N times)           |
|        (N = 32 for 7B, 80 for 70B, etc.)              |
|                                                        |
|  +--------------------------------------------------+  |
|  |  MASKED MULTI-HEAD SELF-ATTENTION               |  |
|  |                                                  |  |
|  |  For each token position i:                      |  |
|  |    Q_i = W_Q . x_i  (What am I looking for?)      |  |
|  |    K_j = W_K . x_j  (What do I contain?)        |  |
|  |    V_j = W_V . x_j  (What do I provide?)        |  |
|  |                                                  |  |
|  |  Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) . V|  |
|  |  Causal mask: token i can only see j <= i        |  |
|  |  Multi-head: 32-128 parallel attention heads      |  |
|  +-------------------------+------------------------+  |
|                            v                           |
|  +--------------------------------------------------+  |
|  |  ADD & LAYER NORM (Residual Connection)          |  |
|  |  output = LayerNorm(x + Attention(x))            |  |
|  +-------------------------+------------------------+  |
|                            v                           |
|  +--------------------------------------------------+  |
|  |  FEED-FORWARD NETWORK (FFN / MLP)                |  |
|  |  FFN(x) = W2 . GELU(W1 . x + b1) + b2           |  |
|  |  Expands: d_model -> 4x d_model -> d_model       |  |
|  |  This is where "knowledge" is stored             |  |
|  +-------------------------+------------------------+  |
|                            v                           |
|  +--------------------------------------------------+  |
|  |  ADD & LAYER NORM (Residual Connection)          |  |
|  +-------------------------+------------------------+  |
+----------------------------+---------------------------+
                             v
+-------------------------------------------------------+
|  LM HEAD (Linear + Softmax)                            |
|                                                        |
|  hidden_state in R^d_model                             |
|       |                                                |
|       v                                                |
|  W_unembed . hidden_state -> logits in R^vocab_size    |
|       |                                                |
|       v                                                |
|  softmax(logits / temperature) -> probability dist.    |
|       |                                                |
|       v                                                |
|  Sample or argmax -> next token ID                     |
+-------------------------------------------------------+

Self-Attention in Detail

Self-attention is the core innovation. For an input sequence of vectors [x_1, x_2, ..., x_n]:

Project to Q, K, V

Q_i = W_Q . x_i (What am I looking for?)
K_i = W_K . x_i (What do I contain?)
V_i = W_V . x_i (What info do I provide if matched?)

↓

Compute Attention Scores

score(i, j) = Q_i . K_j^T / sqrt(d_k)
The sqrt(d_k) scaling prevents softmax saturation

↓

Apply Causal Mask

Set score(i, j) = -inf for all j > i
Tokens can only attend to earlier tokens (no future cheating)

↓

Softmax + Weighted Sum

alpha(i,j) = softmax(scores_i) -- each row sums to 1.0
output_i = SUM_j alpha(i,j) . V_j -- weighted combination

Multi-Head Attention: Instead of one set of Q, K, V projections, the model runs h heads in parallel. Different heads learn different relationships -- syntactic, positional, semantic, coreference. Outputs are concatenated and projected: MultiHead(Q,K,V) = Concat(head_1, ..., head_h) . W_O.

# Causal mask for 4 tokens

      K_1   K_2   K_3   K_4
Q_1 [ ok    -inf  -inf  -inf ]
Q_2 [ ok    ok    -inf  -inf ]
Q_3 [ ok    ok    ok    -inf ]
Q_4 [ ok    ok    ok    ok   ]

KV Cache -- Critical for Production Performance

During autoregressive generation, without optimization the model recomputes attention over all previous tokens every step -- O(n^2) total work. The KV Cache eliminates this:

# Without KV Cache (naive) -- O(n^2) total compute
Step 1: Compute Q,K,V for tokens [1]           -> generate token 2
Step 2: Compute Q,K,V for tokens [1,2]         -> generate token 3
Step 3: Compute Q,K,V for tokens [1,2,3]       -> generate token 4

# With KV Cache -- O(n) total compute
Step 1: Compute Q,K,V for token [1], cache K_1,V_1  -> generate token 2
Step 2: Compute Q for token [2] only, use cached K,V -> generate token 3
Step 3: Compute Q for token [3] only, use cached K,V -> generate token 4

Production implications: KV cache grows linearly with sequence length and batch size. For a 70B model with 128K context, KV cache alone can be ~40GB. Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce this.

Sampling Strategies

The model outputs logits (raw scores) over the vocabulary. How you convert these to a chosen token is the sampling strategy:

# Raw logits from model
logits = [2.0, 1.5, 0.5, -1.0, ...]  # one per vocab token

# Step 1: Apply temperature
scaled_logits = logits / temperature
# T < 1.0: sharpens (more deterministic)
# T = 1.0: unchanged
# T > 1.0: flattens (more random)

# Step 2: Apply top-k filtering
top_k_logits = keep_top_k(scaled_logits, k=50)

# Step 3: Apply top-p (nucleus) filtering
top_p_logits = keep_top_p(top_k_logits, p=0.9)

# Step 4: Convert to probabilities + sample
probs = softmax(top_p_logits)
next_token = random.choice(vocab, p=probs)

Temperature intuition with a concrete example:

Suppose the model's top-5 token probabilities (after softmax at T=1.0):

"Paris"    : 0.70      At T=0.5 (sharper):    At T=1.5 (flatter):
"Lyon"     : 0.15      "Paris"    : 0.92       "Paris"    : 0.45
"the"      : 0.08      "Lyon"     : 0.05       "Lyon"     : 0.22
"Marseille": 0.05      "the"      : 0.02       "the"      : 0.16
"a"        : 0.02      "Marseille": 0.01       "Marseille": 0.11
                        "a"        : 0.00       "a"        : 0.06

Temperature	Use Case
`T=0.0`	Structured output, classification, extraction, code generation
`T=0.3-0.7`	General Q&A, summarization
`T=0.8-1.0`	Creative writing, brainstorming

Training Pipeline

Pre-training

Next-token prediction on trillions of tokens
Thousands of GPUs for weeks/months
Result: base model -- good at completion, not instructions

↓

Supervised Fine-tuning (SFT)

Learn to follow instructions on 100K-1M curated pairs
Result: instruction-tuned model

↓

RLHF / DPO (Alignment)

RLHF: reward model + PPO optimization
DPO: direct preference optimization (simpler, increasingly preferred)
Result: aligned model -- helpful, harmless, honest

04. System Design -- Production Architecture

Where the LLM fits in a production backend, with all the components you need around it:

                         +------------------------------------------+
                         |           YOUR BACKEND                    |
                         |                                          |
+---------+   HTTPS/WSS |  +---------+   +------------------+       |
|  Client |-------------|  |   API   |-->|  Prompt Builder  |       |
| (React, |<------------|  | Gateway |   |                  |       |
|  Mobile)|   SSE stream |  |         |   | - Template engine|       |
+---------+              |  | - Auth  |   | - Variable inject|       |
                         |  | - Rate  |   | - Few-shot select|       |
                         |  |   limit |   +--------+---------+       |
                         |  | - CORS  |            v                 |
                         |  +---------+   +------------------+       |
                         |                |  Token Counter    |       |  Budget
                         |                | - Pre-flight check|-------  exceeded?
                         |                | - Truncation      |       |  -> 413
                         |                +--------+----------+       |
                         |                         v                  |
                         |                +------------------+        |
                         |                |  Semantic Cache   |        |  Cache
                         |                |  (Redis + embeds) |--------  hit?
                         |                | - Hash lookup     |        |  -> return
                         |                | - Similarity match|        |
                         |                +--------+----------+        |
                         |                         v                  |
                         |                +------------------+        |      +--------+
                         |                |  LLM API Client  |------------->| Claude |
                         |                | - Retry w/ backoff|<-------------| API    |
                         |                | - Stream proxy   |        |      +--------+
                         |                +--------+----------+        |
                         |                         v                  |
                         |                +------------------+        |
                         |                |  Output Pipeline |        |
                         |                | - JSON parser    |        |
                         |                | - Schema validate|        |
                         |                | - Safety filter  |        |
                         |                | - PII redaction  |        |
                         |                +--------+----------+        |
                         |                         v                  |
                         |                +------------------+        |
                         |                |  Observability   |----------> Prometheus
                         |                | - Token counts   |----------> Grafana
                         |                | - Latency (TTFT) |----------> Datadog
                         |                | - Cost per req   |        |
                         |                +------------------+        |
                         +------------------------------------------+

Key Production Patterns

# Prompt Builder -- treat prompts as compiled artifacts
class PromptBuilder:
    def __init__(self, template_dir: str):
        self.env = jinja2.Environment(loader=FileSystemLoader(template_dir))

    def build(self, template_name: str, **variables) -> str:
        template = self.env.get_template(template_name)
        return template.render(**variables)

# Token Budget Manager -- always count before calling
async def safe_generate(prompt: str, system: str, max_tokens: int):
    input_tokens = count_tokens(system) + count_tokens(prompt)
    total_needed = input_tokens + max_tokens

    if total_needed > CONTEXT_LIMIT:
        allowed = CONTEXT_LIMIT - count_tokens(system) - max_tokens
        prompt = truncate_middle(prompt, allowed)

    return await llm_client.generate(prompt, system, max_tokens)

# Model Routing -- use smaller models for simple tasks
MODEL_ROUTER = {
    "classification":     "claude-haiku-4-5-20251001",    # Fast, cheap
    "summarization":      "claude-sonnet-4-20250514",     # Balanced
    "complex_reasoning":  "claude-opus-4-20250514",      # Best quality
}

05. Hands-on Demo

The app.py script demonstrates six core LLM concepts interactively: tokenization, token cost comparison, generation (streaming vs non-streaming), temperature effects, context budget checking, and structured code review output.

# Setup
pip install -r requirements.txt    # from repo root
cp .env.example .env               # add your ANTHROPIC_API_KEY

# Run the demo
cd 01-how-llms-work
python app.py

The demo walks through each concept with real API calls so you can observe tokenization splitting, cost differences between verbose and compact formats, streaming vs buffered generation, and how temperature controls output randomness.

06. Common Pitfalls

Critical Treating the LLM as a database

The model generates statistically plausible continuations, not looked-up facts. Never ask it to recall specific numbers or dates without grounding. Use RAG: retrieve real data, inject into prompt, let the model reason over it.

Critical Prompt injection naivety

Putting user input directly into prompts lets users override your instructions. Use the API's dedicated system parameter, validate/sanitize input, and never use LLM output for security-critical decisions (auth, permissions).

Critical Ignoring the stop reason

If stop_reason is max_tokens instead of end_turn, the response was truncated. Your JSON is missing closing braces. Always check and handle: retry with higher limit, continue generation, or return error.

Important Ignoring token costs until the bill arrives

100K token prompt x $3/M input tokens = $0.30/request. At 1,000 req/hour = $7,200/day. Measure tokens per endpoint, set up cost dashboards, implement semantic caching, route simple tasks to cheaper models.

Important Not streaming user-facing responses

A 500-token response at 50ms/token = 25s wait. With streaming, perceived latency drops to ~500ms. Always stream in user-facing apps. Only buffer when you need the complete response (e.g., JSON parsing).

Important Hardcoding model names

Models get deprecated, new versions release, you want to A/B test. Use environment variables or config: MODEL = os.environ.get("LLM_MODEL", "claude-sonnet-4-20250514").

Tip No fallback strategy

If the LLM API is down, your entire app crashes. Implement fallback: try alternate provider on RateLimitError, enqueue for async processing on APIError.

07. Exercises

Exercise 1 -- Token Budget Manager

Build middleware that enforces token budgets per API key: count tokens before hitting the LLM, reject over-budget requests (413), track cumulative usage, return 429 when daily budget exceeded. Add headers: X-Tokens-Used, X-Tokens-Remaining. Stretch: automatic prompt truncation (trim from middle).

Exercise 2 -- A/B Temperature Tester

Build a script that empirically measures how temperature affects consistency. Run the same prompt N times at each temperature, compute pairwise Jaccard similarity on token sets, average into a consistency score (0-1). Expected: T=0.0 ~ 1.0, T=0.5 ~ 0.7-0.9, T=1.0 ~ 0.3-0.6. Stretch: plot with matplotlib, compare factual vs creative prompts.

Exercise 3 -- Streaming Code Review Pipeline

Build a code review system that handles large diffs: parse into per-file chunks, estimate token counts, split files exceeding 50% of context window, stream review comments as SSE events with file/line/severity/message. Implement retry with exponential backoff (max 3 per chunk). Stretch: use tool_use for guaranteed JSON schema.

08. Architect-Level Questions

Q1: A user reports your LLM API sometimes returns truncated JSON. What's happening and how do you fix it?

Think: stop_reason "max_tokens" vs "end_turn", pre-estimating output size, tool_use for schema compliance, try/catch JSON parsing, continuation strategy, monitoring alert on truncation rate.

Q2: Why is autoregressive generation inherently sequential, and what are the implications for system latency?

Think: causal token dependency, latency = TTFT + (output_tokens x per_token_time), streaming for perceived latency, batching for throughput, speculative decoding, KV cache vs sequential dependency.

Q3: Your LLM costs jumped 5x this month. Walk me through diagnosis and fix.

Think: token dashboards by endpoint/model/time, prompt template changes, retry loops, per-item vs batch calls, model tier upgrades, semantic caching, model routing, token budgets per endpoint.

Q4: Explain the difference between KV cache and semantic cache. When would you use each?

Think: KV cache is internal (GPU memory, avoids recomputing attention per token), semantic cache is application-layer (Redis/vector DB, avoids calling LLM entirely). KV is automatic; semantic is an architectural decision for FAQ-style queries, deterministic tasks, expensive prompts.

Q5: You need to process a 500-page legal document (~400K tokens) through a 200K context window. Design the system.

Think: clarify the task first (extraction -> RAG, summary -> map-reduce, Q&A -> RAG + re-ranking). Chunking at paragraph boundaries with 10-20% overlap. Map-reduce parallelizes but processes all tokens; RAG only processes retrieved chunks. Cost vs accuracy tradeoffs.

09. Quick Reference Card

LLM Engineering Cheatsheet

TOKENS 1 token ~ 4 chars ~ 0.75 words (English). Always count before API calls.

CONTEXT Input tokens + output tokens <= context window. Reserve max_tokens for output.

LATENCY TTFT + (output_tokens x per_token_time). Stream for user-facing. Buffer for parsing.

COST input_tokens x input_price + output_tokens x output_price. Monitor. Set budgets. Cache.

TEMPERATURE 0.0 = deterministic (structured output) | 0.5 = balanced | 1.0 = creative

STOP REASON "end_turn" = complete | "max_tokens" = TRUNCATED (handle this!)

GOLDEN RULES 1. Never trust LLM output as fact 2. Always check stop_reason 3. Stream user-facing responses 4. Count tokens before calling 5. Treat prompts as code