The complete engineering guide — from tokens to production architecture
A Large Language Model is a neural network trained on one objective: predict the next token in a sequence. That deceptively simple goal, at massive scale (billions of parameters, trillions of training tokens), produces emergent capabilities like reasoning, summarization, and code generation.
Text is split into subword units via BPE (Byte Pair Encoding). "unhappiness" becomes ["un", "happi", "ness"] (3 tokens). You pay per token. JSON is expensive (braces, quotes, colons each consume tokens). A typical vocabulary: 32K-200K tokens.
Each token ID maps to a high-dimensional vector (e.g., 4096-d). These encode semantic meaning: king - man + woman ~ queen happens in this space. Embeddings bridge discrete text and continuous math.
The Transformer architecture (2017, "Attention Is All You Need") replaced RNNs with self-attention -- every token looks at every other token in parallel. This parallelism enables efficient GPU training at scale.
At inference, the model generates one token at a time. Each new token depends on all previous tokens. This is why: responses stream token-by-token, longer outputs take longer, and you cannot skip ahead.
Understanding LLM internals drives architectural and business decisions, not just academic curiosity.
| Concern | The Engineering Reality | Decision |
|---|---|---|
| Latency | 500-token response at 50ms/token = 25s wall-clock. Streaming shows first token in ~500ms. | Always stream in user-facing apps. |
| Cost | Verbose JSON (800 tokens) vs optimized prompt (200 tokens) = 75% savings. At 1M requests that's $1,800. | Measure token usage per endpoint. Route simple tasks to cheaper models. |
| Context | Attention is O(n^2) in memory. 200K token window exists but filling it is expensive. | Send minimum context. Use RAG instead of dumping docs. |
| Hallucination | Model predicts plausible tokens, not true ones. Confidently wrong on rare facts. | Never use raw output as truth. Use RAG, tool use, validation layers. |
| Prompts | System prompt + user message = "left context" shifting probability distribution. | Treat prompts as code: version, test, review in PRs. |
| Company Type | How They Apply LLM Internals |
|---|---|
| SaaS products | Token budgeting per tenant, model routing by task complexity |
| Customer support | Streaming for chat UX, structured output for ticket classification |
| Developer tools | Context window management for code completion (relevant files only) |
| Search engines | RAG pipelines -- embed, retrieve, generate with grounding |
| Healthcare / Legal | Heavy validation layers because hallucination is unacceptable |
Modern LLMs (GPT, Claude, Llama) use a decoder-only Transformer. Here is the complete data flow from input text to next-token prediction.
Input Text: "The cat sat" | v +-----------------------+ | TOKENIZER | "The cat sat" -> [464, 3797, 3290] | (BPE / SPM) | Deterministic, not neural +-----------+-----------+ v +-----------------------+ | TOKEN EMBEDDING | token_id 464 -> vector in R^d_model | + POSITIONAL | Adds position info (RoPE or learned) | ENCODING | +-----------+-----------+ v +-------------------------------------------------------+ | TRANSFORMER BLOCK (repeated N times) | | (N = 32 for 7B, 80 for 70B, etc.) | | | | +--------------------------------------------------+ | | | MASKED MULTI-HEAD SELF-ATTENTION | | | | | | | | For each token position i: | | | | Q_i = W_Q . x_i (What am I looking for?) | | | | K_j = W_K . x_j (What do I contain?) | | | | V_j = W_V . x_j (What do I provide?) | | | | | | | | Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) . V| | | | Causal mask: token i can only see j <= i | | | | Multi-head: 32-128 parallel attention heads | | | +-------------------------+------------------------+ | | v | | +--------------------------------------------------+ | | | ADD & LAYER NORM (Residual Connection) | | | | output = LayerNorm(x + Attention(x)) | | | +-------------------------+------------------------+ | | v | | +--------------------------------------------------+ | | | FEED-FORWARD NETWORK (FFN / MLP) | | | | FFN(x) = W2 . GELU(W1 . x + b1) + b2 | | | | Expands: d_model -> 4x d_model -> d_model | | | | This is where "knowledge" is stored | | | +-------------------------+------------------------+ | | v | | +--------------------------------------------------+ | | | ADD & LAYER NORM (Residual Connection) | | | +-------------------------+------------------------+ | +----------------------------+---------------------------+ v +-------------------------------------------------------+ | LM HEAD (Linear + Softmax) | | | | hidden_state in R^d_model | | | | | v | | W_unembed . hidden_state -> logits in R^vocab_size | | | | | v | | softmax(logits / temperature) -> probability dist. | | | | | v | | Sample or argmax -> next token ID | +-------------------------------------------------------+
Self-attention is the core innovation. For an input sequence of vectors [x_1, x_2, ..., x_n]:
Multi-Head Attention: Instead of one set of Q, K, V projections, the model runs h heads in parallel. Different heads learn different relationships -- syntactic, positional, semantic, coreference. Outputs are concatenated and projected: MultiHead(Q,K,V) = Concat(head_1, ..., head_h) . W_O.
During autoregressive generation, without optimization the model recomputes attention over all previous tokens every step -- O(n^2) total work. The KV Cache eliminates this:
Production implications: KV cache grows linearly with sequence length and batch size. For a 70B model with 128K context, KV cache alone can be ~40GB. Techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce this.
The model outputs logits (raw scores) over the vocabulary. How you convert these to a chosen token is the sampling strategy:
Temperature intuition with a concrete example:
Suppose the model's top-5 token probabilities (after softmax at T=1.0): "Paris" : 0.70 At T=0.5 (sharper): At T=1.5 (flatter): "Lyon" : 0.15 "Paris" : 0.92 "Paris" : 0.45 "the" : 0.08 "Lyon" : 0.05 "Lyon" : 0.22 "Marseille": 0.05 "the" : 0.02 "the" : 0.16 "a" : 0.02 "Marseille": 0.01 "Marseille": 0.11 "a" : 0.00 "a" : 0.06
| Temperature | Use Case |
|---|---|
T=0.0 | Structured output, classification, extraction, code generation |
T=0.3-0.7 | General Q&A, summarization |
T=0.8-1.0 | Creative writing, brainstorming |
Where the LLM fits in a production backend, with all the components you need around it:
+------------------------------------------+
| YOUR BACKEND |
| |
+---------+ HTTPS/WSS | +---------+ +------------------+ |
| Client |-------------| | API |-->| Prompt Builder | |
| (React, |<------------| | Gateway | | | |
| Mobile)| SSE stream | | | | - Template engine| |
+---------+ | | - Auth | | - Variable inject| |
| | - Rate | | - Few-shot select| |
| | limit | +--------+---------+ |
| | - CORS | v |
| +---------+ +------------------+ |
| | Token Counter | | Budget
| | - Pre-flight check|------- exceeded?
| | - Truncation | | -> 413
| +--------+----------+ |
| v |
| +------------------+ |
| | Semantic Cache | | Cache
| | (Redis + embeds) |-------- hit?
| | - Hash lookup | | -> return
| | - Similarity match| |
| +--------+----------+ |
| v |
| +------------------+ | +--------+
| | LLM API Client |------------->| Claude |
| | - Retry w/ backoff|<-------------| API |
| | - Stream proxy | | +--------+
| +--------+----------+ |
| v |
| +------------------+ |
| | Output Pipeline | |
| | - JSON parser | |
| | - Schema validate| |
| | - Safety filter | |
| | - PII redaction | |
| +--------+----------+ |
| v |
| +------------------+ |
| | Observability |----------> Prometheus
| | - Token counts |----------> Grafana
| | - Latency (TTFT) |----------> Datadog
| | - Cost per req | |
| +------------------+ |
+------------------------------------------+
The app.py script demonstrates six core LLM concepts interactively: tokenization, token cost comparison, generation (streaming vs non-streaming), temperature effects, context budget checking, and structured code review output.
The demo walks through each concept with real API calls so you can observe tokenization splitting, cost differences between verbose and compact formats, streaming vs buffered generation, and how temperature controls output randomness.
The model generates statistically plausible continuations, not looked-up facts. Never ask it to recall specific numbers or dates without grounding. Use RAG: retrieve real data, inject into prompt, let the model reason over it.
Putting user input directly into prompts lets users override your instructions. Use the API's dedicated system parameter, validate/sanitize input, and never use LLM output for security-critical decisions (auth, permissions).
If stop_reason is max_tokens instead of end_turn, the response was truncated. Your JSON is missing closing braces. Always check and handle: retry with higher limit, continue generation, or return error.
100K token prompt x $3/M input tokens = $0.30/request. At 1,000 req/hour = $7,200/day. Measure tokens per endpoint, set up cost dashboards, implement semantic caching, route simple tasks to cheaper models.
A 500-token response at 50ms/token = 25s wait. With streaming, perceived latency drops to ~500ms. Always stream in user-facing apps. Only buffer when you need the complete response (e.g., JSON parsing).
Models get deprecated, new versions release, you want to A/B test. Use environment variables or config: MODEL = os.environ.get("LLM_MODEL", "claude-sonnet-4-20250514").
If the LLM API is down, your entire app crashes. Implement fallback: try alternate provider on RateLimitError, enqueue for async processing on APIError.
Build middleware that enforces token budgets per API key: count tokens before hitting the LLM, reject over-budget requests (413), track cumulative usage, return 429 when daily budget exceeded. Add headers: X-Tokens-Used, X-Tokens-Remaining. Stretch: automatic prompt truncation (trim from middle).
Build a script that empirically measures how temperature affects consistency. Run the same prompt N times at each temperature, compute pairwise Jaccard similarity on token sets, average into a consistency score (0-1). Expected: T=0.0 ~ 1.0, T=0.5 ~ 0.7-0.9, T=1.0 ~ 0.3-0.6. Stretch: plot with matplotlib, compare factual vs creative prompts.
Build a code review system that handles large diffs: parse into per-file chunks, estimate token counts, split files exceeding 50% of context window, stream review comments as SSE events with file/line/severity/message. Implement retry with exponential backoff (max 3 per chunk). Stretch: use tool_use for guaranteed JSON schema.