Deep Dive

How Transformer LLMs Actually Work

A comprehensive, interactive journey from attention mechanisms to production deployment. Built by someone who's been writing code since the Commodore 64 era.

14 chapters~45 min read20+ interactive visualizations

Part I: Setting the Stage

Understanding why Transformers were invented by understanding what they replaced.

Chapter 1

The Problem That Needed Solving

I'd been implementing neural networks for years before the Transformer came along. I remember the excitement when LSTMs first clicked for me—finally, a way to model sequences that didn't immediately forget everything! But I also remember the frustration: training was slow, sequences longer than a few hundred tokens were problematic, and there was always this nagging sense that we were fighting the architecture rather than working with it.

Then in June 2017, a paper with possibly the most provocative title in machine learning history dropped: "Attention Is All You Need." The claim was audacious—throw out recurrence entirely and rely solely on attention mechanisms. I was skeptical. How could you model sequences without... sequence processing?

To understand why Transformers were such a breakthrough, we need to understand what they replaced and why that replacement was necessary.

The Sequential Processing Dilemma

Language has order. "Dog bites man" means something very different from "Man bites dog." Any architecture that processes language needs to respect this ordering. The natural approach, which dominated for years, was to process sequences one element at a time.

Recurrent Neural Networks (RNNs) embodied this intuition perfectly. At each timestep t, the network computes a hidden state h_t as a function of the previous hidden state h_{t-1} and the current input x_t:

h_t = f(h_{t-1}, x_t)

This is elegant. The hidden state acts as a "memory" that carries context forward through the sequence. Process "The cat sat on the" token by token, and by the time you reach "the," your hidden state hopefully encodes something useful about cats and sitting.

But there's a fundamental problem: sequential computation can't be parallelized. You can't compute h_{10} until you've computed h_9, which requires h_8, and so on. Your expensive GPU with thousands of cores? Most of them sit idle, waiting for the sequential chain to complete.

The Vanishing Gradient Problem

Even worse than the parallelization issue is the vanishing gradient problem. When we train neural networks with backpropagation, gradients flow backward through the network. For RNNs, this means gradients must flow backward through time—through every single timestep.

Here's the mathematical reality: when you backpropagate through a recurrent connection, you multiply by the Jacobian matrix of the hidden state transition. If the eigenvalues of this matrix are less than 1, gradients shrink exponentially. If greater than 1, they explode exponentially.

In practice, this meant RNNs could only effectively learn dependencies spanning maybe 10-20 timesteps. Need to connect a pronoun to its antecedent 50 words back? Good luck.

Interactive: Vanishing Gradient Visualizer

Adjust the multiplicative factor to see how gradients behave as they propagate through time. Factor < 1 causes vanishing, > 1 causes explosion.

Vanishing Gradient Visualizer

See how gradients propagate through time in RNNs

0.5 (Vanishing)1.0 (Stable)1.5 (Exploding)
Stable Gradients
Gradient Magnitude at Each TimestepGradient flows ← from output to input
100%
50%
10%
Step 0← Earlier layers (harder to train)Step 19
Initial Gradient
100%
After 10 Steps
34.9%
Final (20 Steps)
13.51%
The Math
gradient(t) = gradient(0) × factor^t = 1.0 × 0.90^20 = 12.1577%
Why This Matters: When gradients vanish, early layers receive near-zero learning signals. The network can't learn long-range dependencies because information from distant timesteps can't influence weight updates. LSTMs mitigate this with additive cell state updates.

LSTMs and GRUs: A Partial Solution

Long Short-Term Memory networks were designed specifically to address vanishing gradients. The key innovation was the "cell state"—a highway that allows information to flow through time with minimal transformation. Instead of a single hidden state being multiplied at each step, LSTMs use gating mechanisms:

  • Forget gate: decides what to discard from the cell state
  • Input gate: decides what new information to store
  • Output gate: decides what to output based on the cell state

The cell state updates through addition rather than multiplication, which helps gradients flow more easily. LSTMs could handle dependencies of maybe 100-200 tokens—much better than vanilla RNNs, but still fundamentally limited.

And they were still sequential. Still couldn't parallelize. Still left most of your GPU idle.

The Attention Insight

The breakthrough came from an unexpected direction: machine translation. In 2014, Bahdanau and colleagues introduced attention for sequence-to-sequence models. The idea was simple but profound: instead of forcing all information through a fixed-size hidden state bottleneck, let the decoder "look back" at all encoder states and focus on the relevant ones.

This was attention as an add-on to RNNs. The Transformer paper asked: what if attention was the only mechanism? What if we threw out recurrence entirely?

Interactive: RNN vs Attention Path Length

Compare how information flows in RNNs (sequential, O(n) path length) versus attention (direct connections, O(1) path length). Adjust sequence length to see the difference.

RNN vs Attention Path Length

Compare how information flows in sequential vs parallel architectures

RNN: Sequential ProcessingO(n) path length
1 / 8 steps
0
t0
1
t1
2
t2
3
t3
4
t4
5
t5
6
t6
7
t7
Attention: Parallel ProcessingO(1) path length
Waiting...
0
t0
1
t1
2
t2
3
t3
4
t4
5
t5
6
t6
7
t7
RNN Path Length
7
O(n) - grows with sequence
Attention Path Length
1
O(1) - constant
Speedup Factor
7×
For information propagation
Why Path Length Matters: In RNNs, information from position 0 must pass through 7 hidden states to reach position 7. Each step is a chance for information loss and gradient decay. Attention provides direct O(1) connections between any positions, enabling learning of long-range dependencies.

The visualization above captures the key insight. In an RNN, if position 1 needs to influence position 20, the signal must pass through 19 hidden states—19 chances for information to be corrupted or lost. With attention, position 1 can directly attend to position 20. The path length is constant, regardless of sequence length.

This isn't just theoretically elegant—it's practically transformative. Shorter paths mean better gradient flow. Direct connections mean the model can learn arbitrary dependencies without fighting the architecture.

And crucially: all attention computations for all positions can happen in parallel. Your GPU is finally earning its keep.

Chapter 2

"Attention Is All You Need" — The 2017 Revolution

Let's build up the Transformer architecture piece by piece. I find that the individual components make intuitive sense, but it took me a while to see how they fit together into something greater than the sum of its parts.

Self-Attention: The Core Innovation

Self-attention is the mechanism that lets each position in a sequence attend to all other positions. Consider the sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal or the street?

For humans, this is easy—"tired" makes it clear we're talking about the animal. For a model, this requires connecting "it" to "animal" across multiple intervening words. Self-attention provides exactly this capability: when processing "it," the model can directly attend to "animal" and learn to recognize this pattern.

Query, Key, Value: The Information Retrieval Analogy

The Query-Key-Value framework is perhaps the cleverest abstraction in the Transformer. Think of it like a database lookup:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What do I return if you match me?"

Each position generates all three vectors by projecting its embedding through learned weight matrices W_Q, W_K, and W_V. Then:

  1. The query from position i is compared against keys from all positions
  2. This comparison produces attention scores (how relevant is each position?)
  3. Scores are normalized via softmax to get attention weights
  4. Values are combined using these weights to produce the output

Mathematically, for an input X with n tokens and d dimensions:

Q = X · W_Q, K = X · W_K, V = X · W_V
Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

That √d_k scaling factor isn't arbitrary. Without it, when d_k is large, the dot products grow large, pushing softmax into regions where it has extremely small gradients. The scaling keeps the variance of the dot products at 1, keeping softmax in a well-behaved regime.

Interactive: Attention Score Calculator

Step through the attention computation on a small example. See how Q, K, V matrices are formed and how attention weights emerge.

Attention Score Calculator

Step through the attention computation

Step 1: Input Embeddings
0: The1: cat2: sat3: down
We start with input embeddings X ∈ ℝ^(4×4) — 4 tokens with 4-dimensional embeddings.
X (Input Embeddings)
0.500.20-0.300.80
0.100.900.40-0.20
-0.400.300.700.10
0.60-0.100.200.50
Complete Formula
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Multi-Head Attention: Multiple Perspectives

A single attention head learns one notion of "relevance"—maybe syntactic structure, maybe semantic similarity, maybe something else entirely. Multi-head attention runs multiple attention operations in parallel, each with its own learned projections.

In practice, heads specialize. Research has shown that different heads learn to track different types of relationships: some focus on adjacent tokens, some on syntactic dependencies, some on semantic similarity across long distances.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(Q · W_Q^i, K · W_K^i, V · W_V^i)

Interactive: Multi-Head Attention Patterns

Explore how different attention heads specialize. Each heatmap shows which tokens attend to which other tokens for a given head.

Multi-Head Attention Patterns

See how different heads specialize

Input Sentence
Thecatsatonthematbecauseitwastired
Head 3: Coreference
The
cat
sat
on
the
mat
because
it
was
tired
The
62
cat
62
sat
62
on
62
the
62
mat
62
because
62
it
66
was
63
tired
37
Low attention
High attention
Rows: query tokens • Columns: key tokens
Head Specialization: Different attention heads learn different patterns. Some track local context, others handle coreference ("it" → "cat"), syntax, or global relationships. This diversity is why multi-head attention outperforms single-head.

Positional Encoding: Where Am I?

Here's a subtle but critical point: attention is permutation equivariant. If you shuffle the input tokens, attention doesn't care—it produces shuffled outputs. But "Dog bites man" ≠ "Man bites dog"!

Positional encodings inject position information into the model. The original Transformer used sinusoidal functions:

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Why sinusoids? They have a beautiful property: the encoding for position pos+k can be expressed as a linear function of the encoding for position pos. This means the model can potentially learn to attend to relative positions.

Interactive: Positional Encoding Visualizer

Visualize how sinusoidal positional encodings create unique patterns for each position. Different dimensions oscillate at different frequencies.

Positional Encoding Visualizer

Explore sinusoidal position encodings

Wave Patterns Across Dimensions
Lower dimensions oscillate faster • Higher dimensions oscillate slower
Encoding at Position 10
-1 0 +1
Position 10
-0.5
-0.8
0.9
0.3
-0.6
0.8
-0.9
-0.5
-0.0
-1.0
0.7
-0.7
1.0
-0.2
1.0
0.2
Position 15 (offset +5)
0.7
-0.8
-1.0
0.3
0.8
-0.5
0.0
1.0
-1.0
0.0
-0.4
-0.9
0.5
-0.9
0.9
-0.4
Sinusoidal Encoding Formula
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Unique Patterns

Each position has a unique encoding pattern

Relative Position

PE(pos+k) can be expressed as linear function of PE(pos)

Multi-Scale

Low dims: local patterns • High dims: global patterns

Feed-Forward Networks and Layer Norm

After attention, each position passes through a position-wise feed-forward network. This is just a two-layer MLP, but it's applied independently to each position:

FFN(x) = max(0, x · W_1 + b_1) · W_2 + b_2

The inner dimension is typically 4× the model dimension—this expansion provides representational capacity that attention alone lacks. Modern variants use GELU or SwiGLU activations instead of ReLU.

Residual connections wrap both the attention and FFN sublayers, and layer normalization ensures training stability. The "pre-norm" variant (norm before the sublayer) has become standard as it provides better gradient flow.

The Complete Picture

Stack N of these layers (N=6 in the original paper, N=32-80+ in modern LLMs), and you have a Transformer. Each layer refines the representations: early layers might capture syntactic structure, later layers might encode more abstract semantics.

What struck me when I first implemented this was how simple the individual pieces are. Matrix multiplications, softmax, layer norm, ReLU. Nothing exotic. The magic is in the combination and scale.

Chapter 3

The Training Objective and What Models Actually Learn

Here's what continues to amaze me about modern LLMs: the training objective is almost laughably simple. Next-token prediction. Given the tokens you've seen so far, predict the probability distribution over what comes next.

That's it. No explicit teaching of grammar, no labeled examples of reasoning, no curriculum of skills. Just: predict the next token, billions of times, on trillions of tokens of text.

Why This Works: A Deep Connection

The mathematical justification is elegant. If you model P(x_t | x_1, ..., x_{t-1}) well for all positions in all texts, you're implicitly modeling the entire distribution of human text. Grammar rules? They're just patterns that make certain next tokens more likely. Factual knowledge? It's encoded in which continuations are probable. Reasoning? It's the patterns that connect premises to conclusions.

The loss function is cross-entropy:

L = -Σ log P(x_t | x_1, ..., x_{t-1})

Perplexity, the standard metric, is just exp(loss). A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 options. Modern LLMs achieve perplexities of 8-15 on standard benchmarks.

Interactive: Next-Token Prediction

Type a prompt and see the probability distribution over possible next tokens. Adjust temperature to see how it affects the distribution.

Next-Token Prediction

See how LLMs predict the next token

Current Context
The capital of France is|
1.00
Greedy (0)Default (1)Creative (2)
Probability DistributionTop 10 predictions
"Paris"
87.7%
"the"
4.0%
"a"
2.2%
"known"
1.6%
"located"
1.3%
"famous"
1.0%
"beautiful"
0.8%
"one"
0.6%
"called"
0.5%
"home"
0.4%
Top Token Confidence
87.7%
Perplexity (approx)
1.14
Tokens Generated
0
Temperature Effect: Low temperature (→0) makes the model more deterministic, always picking the highest probability token. High temperature (→2) flattens the distribution, making unlikely tokens more likely to be sampled.

Emergent Capabilities: Scale Changes Everything

Emergent capabilities are perhaps the most surprising aspect of LLM scaling. These are abilities that appear suddenly at certain scales but are absent in smaller models.

GPT-2 (1.5B parameters) could generate coherent paragraphs but struggled with few-shot learning. GPT-3 (175B parameters) could solve new tasks from just a few examples in the prompt—a capability that wasn't explicitly trained and that researchers didn't fully anticipate.

Chain-of-thought reasoning emerged similarly. Tell a large enough model to "think step by step," and it produces intermediate reasoning that improves final answers. Smaller models just produce nonsense when prompted the same way.

These aren't linear improvements—they're phase transitions. The capability is essentially zero, then suddenly it's there. We still don't fully understand why.

Interactive: Emergent Capability Timeline

Explore how capabilities emerged as models scaled from GPT-1 to GPT-4 and beyond.

Emergent Capabilities Timeline

Click on a model to explore its capabilities

GPT-3

2020175B parameters
Major Milestone
Few-Shot LearningNew

Learn new tasks from just a few examples in the prompt

Basic ArithmeticNew

Perform simple calculations

Code GenerationNew

Generate simple code snippets

TranslationNew

Translate between languages without specific training

Phase Transitions: Emergent capabilities don't appear gradually—they emerge suddenly at certain scales. Few-shot learning was nearly absent in GPT-2 but transformative in GPT-3, despite only a ~100× parameter increase.

Part II: From Transformer to Modern LLMs

The architectural refinements and engineering innovations that enable today's systems.

Chapter 4

The Decoder-Only Revolution

The original Transformer was an encoder-decoder architecture designed for translation. The encoder processed the source sentence with bidirectional attention (each position attends to all others). The decoder generated the target sentence autoregressively, with causal attention (each position only attends to previous positions) plus cross-attention to the encoder outputs.

Modern LLMs almost universally use decoder-only architectures. Why did encoder-decoder lose?

The Case for Decoder-Only

Unified Training: With decoder-only, everything is next-token prediction. You don't need paired data (source-target). Any text works.

Task Flexibility: Encoder-decoder assumes a clear input/output split. Decoder-only treats everything as a sequence to continue. Want translation? Prompt with "Translate to French: [text]". Want summarization? "Summarize: [text]". The task is implicit in the prompt.

Compute Efficiency: No separate encoder phase. During inference, you're running one forward pass per generated token, not encoder + decoder.

Interactive: Architecture Comparison

Compare how encoder-decoder and decoder-only architectures process the same task.

Architecture Comparison

Encoder-only vs Decoder-only vs Encoder-Decoder

Decoder-Only

Examples
GPT-4ClaudeLLaMAMistral
Best For
Text generationChatCodeReasoning
Attention Pattern

Causal masking (each token only sees previous)

Bidirectional
Causal
Attention Mask ()
The
cat
sat
on
mat
The
cat
sat
on
mat
Rows: query tokens • Columns: key tokens
Data Flow
Input
Decoder Stack
Causal Self-Attention
Next Token
Quick Comparison
AspectEncoder-OnlyDecoder-OnlyEnc-Dec
ContextFull bidirectionalLeft-to-right onlyBoth
GenerationLimitedNativeNative
UnderstandingExcellentGoodExcellent
2024 TrendEmbeddings/RAGDominantSeq2seq niche
Why Decoder-Only Dominates: Despite encoder-only models excelling at understanding, decoder-only models can do everything (including understanding) by framing tasks as generation. Simpler architecture + unified training objective = easier scaling.

The GPT Progression

GPT-1 (2018, 117M parameters) demonstrated that pre-training on next-token prediction followed by fine-tuning could achieve strong results. The insight: unsupervised pre-training provides a good initialization.

GPT-2 (2019, 1.5B parameters) showed zero-shot capabilities. Without any task-specific training, it could perform basic question answering, summarization, and translation. OpenAI initially declined to release the full model due to concerns about misuse.

GPT-3 (2020, 175B parameters) was the breakthrough that launched the current era. Few-shot learning worked: give the model a few examples in the prompt, and it could perform tasks it was never explicitly trained for. This was genuinely surprising.

GPT-4 (2023) introduced multimodality (images) and likely uses a Mixture of Experts architecture (though OpenAI hasn't confirmed details). Estimated at 1.7T parameters.

Chapter 5

Tokenization — The Often-Overlooked Foundation

Tokenization might be the most underappreciated component of LLMs. It determines how text is chunked into the discrete units the model processes, and it has profound implications for model behavior.

The challenge: characters are too granular (long sequences, sparse learning signal), words are too sparse (huge vocabularies, out-of-vocabulary problems). Subword tokenization finds the middle ground.

Byte Pair Encoding (BPE)

BPE is elegantly simple. Start with a vocabulary of individual characters (or bytes). Iteratively find the most frequent adjacent pair of tokens and merge them into a new token. Repeat until you reach your target vocabulary size.

The result: common words become single tokens ("the" → "the"), rare words get split into subwords ("unconstitutional" → "un" + "constitu" + "tional"), and you can always fall back to individual characters for anything unknown.

Interactive: Tokenization Playground

Type any text to see how it gets tokenized. Compare different tokenizers and see the "token tax" for different languages.

Tokenization Playground

See how text gets split into tokens

44 characters9 words
Tokenized Output19 tokens
The
·
quick
·
brown
·
fox
·
jumps
·
over
·
the
·
lazy
·
do
g
.
Tokenizer Comparison
TokenizerTokensTokens/WordEfficiency
GPT-2192.11
47%
GPT-4192.11
47%
LLaMA192.11
47%
Claude192.11
47%
The Token Tax: Different languages and content types require different numbers of tokens. Code and non-English text often require 2-3× more tokens than English prose, making them more expensive to process.

Interactive: BPE Merge Visualization

Watch BPE build a vocabulary by iteratively merging the most frequent pairs.

BPE Merge Visualization

Watch how BPE builds a vocabulary by merging frequent pairs

Step 0 / 3
Tokenized Corpus
th
ca
sa
o
th
ma
Total tokens: 17
Most Frequent Pairs
"a" + ""×3
"t" + "h"×2
"h" + ""×2
"c" + "a"×1
"s" + "a"×1
Vocabulary10 tokens
achmost
Merge Progress0%
How BPE Works: Starting with individual characters, BPE iteratively merges the most frequent adjacent pair into a new token. Common words become single tokens ("the"), while rare words are split into subwords. The merge order determines the final tokenization.

The Real-World Impact

Tokenization quirks have real consequences:

  • Token fertility: English averages ~1.3 tokens per word. Some languages require 2-3x more tokens for the same content, making them more expensive to process.
  • Arithmetic failures: Numbers often get tokenized inconsistently. "1234" might be one token but "12345" might be "123" + "45". This makes arithmetic unreliable.
  • Code handling: Whitespace-sensitive languages (Python) need careful tokenization to preserve indentation semantics.
Chapter 6

Modern Positional Encoding — RoPE, ALiBi, and Beyond

Sinusoidal positional encodings worked, but they had limitations. The biggest: extrapolation. Train on sequences up to 2048 tokens, and performance degrades on longer sequences. The model hasn't seen those position values before.

RoPE: Rotary Position Embeddings

RoPE, used by LLaMA and many others, encodes position through rotation. Instead of adding positional information to embeddings, RoPE applies rotation matrices to query and key vectors.

The key insight: when you take the dot product of two rotated vectors, the result depends only on their angle difference—which is proportional to the position difference. Position becomes relative, encoded in the geometry of the operation itself.

Interactive: RoPE Rotation Visualizer

Visualize how RoPE encodes position through rotation. The dot product between Q and K depends only on their relative position.

RoPE Rotation Visualizer

See how Rotary Position Embeddings encode position

2D Rotation (dim 0, 1)
0.0°x (dim 0)y (dim 1)
Gray: original • Rose: after RoPE rotation
Rotation Matrix
For position 0, dimension pair 0:
cos(θ) = 1.0000
-sin(θ) = 0.0000
sin(θ) = 0.0000
cos(θ) = 1.0000
Frequency (θd)1.00e+0
Angle (m × θd)0.0000 rad
Angle in degrees0.0°
All Dimension Pairs at Position 0
Dim 0-1
0°
Dim 2-3
0°
Dim 4-5
0°
Dim 6-7
0°
Dim 8-9
0°
Dim 10-11
0°
Dim 12-13
0°
Dim 14-15
0°
Lower dimensions rotate faster (higher frequency). Click to select.
Position-Dimension Angle Matrix
d0
d1
d2
d3
d4
d5
d6
d7
pos 0
pos 1
pos 2
pos 3
pos 4
pos 5
pos 6
pos 7
Color intensity = |sin(angle)|. Notice patterns: low dims cycle fast, high dims cycle slow.
Why RoPE Works: By encoding position as rotation, the relative position between tokens becomes a rotation difference. When computing Q·K^T, these rotations create position-aware attention that's inherently translation-equivariant and extrapolates better to longer sequences than absolute embeddings.

ALiBi: Attention with Linear Biases

ALiBi takes a radically different approach: no learned positional encoding at all. Instead, it adds a linear penalty to attention scores based on distance: positions farther apart get lower attention scores.

Different heads use different slopes, allowing some to focus locally and others to attend more globally. ALiBi extrapolates perfectly to any length (with the assumption that recency matters), though it can't learn arbitrary position-dependent patterns.

Chapter 7

Attention Variants for Efficiency

The KV-cache is both a blessing and a curse. During autoregressive generation, we cache the key and value tensors from previous tokens to avoid recomputation. Essential for reasonable inference speed. But memory consumption scales with: layers × heads × head_dim × sequence_length × batch_size.

For a 70B model serving a 128K context, the KV-cache alone can consume 80+ GB—often more than the model weights themselves.

Multi-Query and Grouped-Query Attention

Multi-Query Attention (MQA) shares a single key and value head across all query heads. KV-cache shrinks by a factor of the number of heads. Quality degrades slightly (5-10%).

Grouped-Query Attention (GQA) is the compromise. Groups of query heads share KV heads. LLaMA 2 70B uses 64 query heads with 8 KV groups—an 8× reduction in KV-cache with quality within 1% of full attention after uptraining.

Interactive: KV-Cache Memory Calculator

Calculate KV-cache memory requirements for different model configurations and attention variants.

KV-Cache Memory Calculator

Calculate memory requirements for different attention variants

1K128K
164
KV-Cache
2.00 GB
512.00 KB per token
Model Weights (est.)
384.0 MB
32 layers × 32 heads
Total Memory
2.38 GB
Fits in A100-80GB
Memory Breakdown
Model WeightsKV-CacheAvailable
Formula
KV-Cache = 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element

Sliding Window Attention

Sliding window attention restricts each token to only attend within a fixed window. O(n × w) complexity instead of O(n²), and the KV-cache has a fixed maximum size regardless of sequence length.

The trick: information can still propagate across the full sequence through layer stacking. With window w and L layers, the effective receptive field is L × w. Mistral uses this with "attention sinks"—always attending to the first few tokens to anchor the representation.

Interactive: Sliding Window Receptive Field

See how information propagates through layers with sliding window attention. The effective receptive field grows with depth.

Sliding Window Attention

See how receptive field grows through layers

Receptive Field Growth (Token T7)
Input
0
1
2
3
4
5
6
7
8
9
10
11
Layer 1
0
1
2
3
4
5
6
7
8
9
10
11
can see 3-7
Layer 2
0
1
2
3
4
5
6
7
8
9
10
11
can see 0-7
Layer 3
0
1
2
3
4
5
6
7
8
9
10
11
can see 0-7
Layer 4
0
1
2
3
4
5
6
7
8
9
10
11
can see 0-7
Query token
In receptive field
Not visible
Sliding Window Attention
Window Size4
Direct Attention CostO(4)
Effective Receptive Field8 tokens
After 4 layers4 × 4 = 16
Full Attention (Comparison)
Attention RangeAll previous
Direct Attention CostO(8)
Memory Savings50%
KV-Cache Size8 vs 4
Attention Mask Comparison
Full Causal Attention
O(n²) memory
Sliding Window (w=4)
O(n × w) memory
Window Size
4
Total Receptive
8
Memory Reduction
2.0×
Coverage
100%
Sliding Window Magic: With window size W and L layers, a token can attend to W×L positions through information propagation. Mistral uses W=4096, so with 32 layers, effective context is 131K tokens while keeping O(n×W) memory — enabling long sequences without the quadratic cost.
Chapter 8

Mixture of Experts — Scaling Parameters Without Scaling Compute

Here's the scaling dilemma: more parameters generally mean better performance, but more parameters also mean more compute per token. Mixture of Experts (MoE) breaks this link.

Replace the dense FFN with N expert FFNs. A router network selects the top-k experts for each token. Most parameters are inactive for any given token—you get the capacity of a large model with the compute cost of a smaller one.

Mixtral: A Concrete Example

Mixtral 8x7B has 8 experts with top-2 routing. Total parameters: 46.7B. Active parameters per token: ~12.9B. It matches or exceeds LLaMA 2 70B on most benchmarks while using similar compute to a 13B dense model.

Interactive: MoE Routing Visualization

Watch how tokens get routed to different experts. See load balancing in action.

MoE Routing Visualization

Watch tokens route to expert networks

Top-K:
Input Tokens
The
quick
brown
fox
jumps
over
lazy
dog
Router Network
Softmax over expert logits
Syntax
0
0/3
Entity
1
0/3
Number
2
0/3
Logic
3
0/3
Lang
4
0/3
Code
5
0/3
Facts
6
0/3
Style
7
0/3
Routing Decisions
TokenExpert 1Weight 1Expert 2Weight 2
TheSyntax77%Code23%
quickCode72%Facts28%
brownLogic83%Facts17%
foxLogic85%Style15%
jumpsLogic84%Code16%
overLang63%Logic37%
lazyLogic76%Code24%
dogCode67%Facts33%
Active Experts
0/8
Total Load
0
Load Balance
100%
Sparsity
75%
MoE Efficiency: With 8 experts but top-2 routing, only 25% of parameters activate per token — enabling models like Mixtral 8x7B to have 47B total parameters but only 13B active compute. Load balancing losses prevent expert collapse.

The Load Balancing Challenge

Without careful design, the router might learn to always use the same few experts— "expert collapse." An auxiliary load balancing loss encourages even distribution:

L_aux = α × N × Σ(fraction_i × router_prob_i)

The infrastructure implications are significant. All expert parameters must fit in memory (or be distributed with expert parallelism), but compute scales only with active parameters. This creates an unusual memory-compute trade-off.

Chapter 9

Flash Attention — IO-Aware Algorithm Design

Flash Attention is perhaps my favorite example of infrastructure-aware algorithm design. The insight: on modern GPUs, compute is cheap but memory bandwidth is expensive.

An H100 has 3,958 TFLOPS of FP16 compute but only 3.35 TB/s of HBM bandwidth. Standard attention computes the n×n attention matrix, writes it to HBM, reads it back for softmax, writes again, reads for the final matmul. All that memory movement dominates runtime.

The Tiling Solution

Flash Attention never materializes the full n×n matrix. It processes Q, K, V in tiles that fit in SRAM (the GPU's fast on-chip memory). The key enabler is online softmax—computing softmax incrementally as new blocks arrive, tracking running statistics rather than needing all values upfront.

Interactive: Memory Access Patterns

Compare standard vs Flash Attention memory access patterns. See how tiling dramatically reduces HBM traffic.

Flash Attention Memory Access

Compare standard vs IO-aware attention

Step 1/6Load Q, K matrices to HBMREAD
HBM (GPU Memory)
~2 TB/s bandwidth
Q
Q
K
K
Stores full n×n attention matrix
SRAM (On-chip)
~19 TB/s bandwidth
Underutilized
HBM
Slow
Many transfers
SRAM
Fast
Q (Query)
K (Key)
V (Value)
S (Scores)
P (Softmax)
O (Output)
Memory Complexity
O(n²)
HBM Reads
O(n²)
Speedup
Max Sequence
~8K
IO-Aware Algorithm: Flash Attention never materializes the full n×n attention matrix in HBM. By computing softmax in blocks within fast SRAM and using online softmax tricks, it reduces memory IO from O(n²) to O(n²/M) where M is SRAM size — trading extra FLOPs for much faster wall-clock time.

The results are dramatic:

  • Memory: O(n) instead of O(n²)
  • Speed: 2-9× faster depending on sequence length
  • Efficiency: Flash Attention 2 achieves 50-73% of theoretical FLOPS

This is the kind of algorithm you get when systems engineers and ML researchers collaborate closely. The math is the same—the implementation just respects the hardware reality.

Chapter 10

Training at Scale

Training a 70B+ parameter model requires careful orchestration of parallelism strategies. No single GPU can hold the model, and naive approaches waste compute or run out of memory.

The Parallelism Strategies

Data Parallelism: Each GPU holds a full model copy and processes different batches. Gradients are synchronized via all-reduce. Scales compute but not memory.

Tensor Parallelism: Split individual layers horizontally across GPUs. The attention and FFN weight matrices are sharded. Requires communication every layer—needs fast NVLink (400-900 GB/s).

Pipeline Parallelism: Split layers vertically—different GPUs own different layers. Communication only at stage boundaries. InfiniBand (200-400 Gb/s) is sufficient. Uses micro-batching to hide pipeline bubbles.

Interactive: Parallelism Strategies

Visualize how different parallelism strategies distribute model and data across GPUs.

Parallelism Strategy Visualizer

Compare DP, TP, PP, and FSDP approaches

Data Parallelism

Same model on each GPU, different data batches. Gradients synchronized.

GPU Distribution
GPU 0
L0
L1
L2
L3
Batch 0
GPU 1
L0
L1
L2
L3
Batch 1
GPU 2
L0
L1
L2
L3
Batch 2
GPU 3
L0
L1
L2
L3
Batch 3
Advantages
  • +Simple to implement
  • +Linear scaling
  • +Works with any model
Challenges
  • Each GPU needs full model
  • Memory limited by single GPU
  • Gradient sync overhead
Memory/GPU
~140 GB
Communication
AllReduce gradients
Best For
Small models
Typical Scale
1-8 GPUs
Common Combinations (3D Parallelism)
LLaMA 70B
TP=8 within node
PP=1, DP for batch
GPT-3 175B
TP=8, PP=8
64 GPUs minimum
Training at Scale
FSDP + TP hybrid
1000s of GPUs
Combining Strategies: Real-world training uses multiple parallelism types together. TP within a node (fast NVLink), PP across nodes, and FSDP for memory efficiency. The goal: maximize GPU utilization while fitting the model.

ZeRO: Memory Efficiency for Data Parallelism

ZeRO (Zero Redundancy Optimizer) eliminates memory redundancy in data parallelism. Standard DDP stores optimizer states, gradients, and parameters on every GPU. ZeRO shards them:

  • Stage 1: Shard optimizer states (4× memory reduction for Adam)
  • Stage 2: Also shard gradients
  • Stage 3: Also shard parameters (requires gather before forward/backward)

Scaling Laws: The Chinchilla Revolution

In 2022, DeepMind's Chinchilla paper upended conventional wisdom. GPT-3 (175B params, 300B tokens) was undertrained. The compute-optimal ratio is roughly 20 tokens per parameter.

Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance with 4× fewer parameters— smaller, faster to run, trained with the same compute budget but allocated differently.

Interactive: Scaling Law Explorer

Explore the compute-optimal frontier. See where different models fall relative to the Chinchilla-optimal line.

Scaling Law Explorer

Explore the Chinchilla compute-optimal frontier

0.1 (GPT-2)100 (LLaMA 3)
Parameters (Billions)
Training Tokens (Billions)
Optimal Parameters
2.9B
For 1.0 × 10²¹ FLOPs
Optimal Tokens
58B
~20× parameter count
Tokens per Param
20.0
Chinchilla optimal: 20
The Chinchilla Revolution: GPT-3 (175B params, 300B tokens) was massively undertrained. Chinchilla (70B params, 1.4T tokens) achieved similar performance with 60% fewer parameters by training longer. Modern models like LLaMA 3 push even further, training on 200+ tokens per parameter for better inference efficiency.
Chapter 11

Inference Optimization

Training costs are one-time. Inference costs are forever. For widely-used models, inference optimization matters enormously for both cost and user experience.

The KV-Cache Lifecycle

During autoregressive generation, each new token requires attending to all previous tokens. Without caching, you'd recompute K and V for the entire context on every step—O(n²) total compute.

With KV-caching, you compute K and V once per token and store them. Each generation step then only needs O(n) compute. The trade-off: memory consumption grows linearly with context length.

Interactive: Token Generation Animation

Watch the autoregressive generation process step by step. See the KV-cache grow as tokens are generated.

Token Generation Animation

Watch autoregressive generation with KV-cache growth

Step 0 / 6
Generated Sequence
Thequickbrownfox
KV-Cache
4 entries
P0
P1
P2
P3
?
?
?
?
?
?
Prompt Tokens
4
Generated Tokens
0
KV-Cache Size
4
Cache Growth
0%
Current Step
Press Generate to start. The model will predict one token at a time, caching K/V for reuse.
Why KV-Cache Matters: Without caching, each new token would require recomputing K and V for all previous tokens — O(n²) total compute. With caching, we only compute K/V for the new token and reuse cached values — O(n) total, but memory grows linearly.

PagedAttention: Virtual Memory for KV-Cache

PagedAttention (introduced in vLLM) applies virtual memory concepts to KV-cache management. Instead of pre-allocating the maximum possible sequence length, allocate memory in small blocks on demand.

A page table maps logical KV positions to physical memory blocks. Benefits:

  • Near-zero memory fragmentation
  • Memory allocated only as needed
  • Prefix sharing: requests with common prefixes share KV blocks

Interactive: PagedAttention Visualization

See how PagedAttention manages memory dynamically, allocating and recycling blocks as sequences progress.

PagedAttention Visualizer

Watch dynamic KV-cache block allocation

Physical Memory Pool24 blocks × 16 tokens each
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Active Requests
No active requests. Click "Add Request" to start.
Memory Utilization
0%
Used Blocks
0
Free Blocks
24
Active Requests
0
PagedAttention Benefits: Traditional KV-cache pre-allocates maximum sequence length per request, wasting memory. PagedAttention allocates small blocks on-demand and recycles them when requests complete — achieving near-zero fragmentation and enabling prefix caching for shared prompts.

Speculative Decoding

Speculative decoding uses a small, fast "draft" model to propose multiple tokens, then the large "target" model verifies them in parallel. If the draft matches, you've generated multiple tokens in essentially one target forward pass.

Typical speedups: 2-3× for latency. Works best when the draft model is a good approximation— you can use a quantized version of the target, a distilled variant, or a specialized small model.

Interactive: Speculative Decoding Demo

Watch how draft and target models work together to accelerate generation.

Speculative Decoding Demo

Draft model proposes, target model verifies

Phase: Idle
Generated Sequence
The capital of France is
Draft Model (7B)Fast, smaller
Waiting to draft tokens...
Target Model (70B)Accurate, larger
Waiting to verify...
Process Flow
1
Draft K tokens
2
Verify in parallel
3
Accept prefix
4
Resample
Accepted Tokens
0
Rejected Tokens
0
Acceptance Rate
0%
Speedup (est.)
1.0×
How It Works: A small draft model quickly generates K candidate tokens. The large target model verifies them all in a single forward pass (parallel). We accept the longest matching prefix and resample from there — achieving 2-3× speedup while maintaining exact output distribution.
Chapter 12

Long Context — The Frontier

Standard attention is O(n²) in both memory and compute. At 128K tokens, that's 16 billion attention elements per layer. The quadratic wall is the fundamental bottleneck for long-context processing.

Interactive: Quadratic Complexity Visualizer

See how attention complexity scales with sequence length. The quadratic growth quickly becomes prohibitive.

Attention Complexity Visualizer

See how O(n²) quadratic complexity scales

4K32K128K1M+
Standard Attention O(n²)
16.8M
operations
Linear Attention O(n)
4.1K
operations
Quadratic vs Linear
4.1K×
more operations for quadratic
Attention Matrix
32.0 MB
for full n×n matrix (FP16)
Feasibility
Standard
Works with Flash Attention
Scaling Comparison
SequenceO(n²) OpsO(n) OpsMemory (n²)
1.0K1.0M1.0K2.0 MB
4.1K16.8M4.1K32.0 MB
16.4K268.4M16.4K512.0 MB
65.5K4.3B65.5K8.0 GB
262.1K68.7B262.1K128.0 GB
1.0M1.1T1.0M2048.0 GB
The Quadratic Wall: At 128K tokens, standard attention requires 16 billion operations and 32 GB just for the attention matrix. This is why Flash Attention, sparse patterns, and state space models are essential for long-context processing.

Sparse Attention Patterns

Sparse attention reduces complexity by only computing a subset of attention pairs:

  • Longformer: Local sliding window + global tokens
  • BigBird: Random + window + global patterns
  • Theoretical result: sparse patterns can be universal approximators

Interactive: Sparse Attention Patterns

Compare different sparse attention patterns and their coverage of the full attention matrix.

Sparse Attention Patterns

Compare full vs sparse attention strategies

Full Attention

Every token attends to every other token. Baseline approach.

Complexity
O(n²)
GPT-2BERTStandard Transformers
Attention Pattern
Rows: query • Columns: key (causal masking applied)
0
1
2
3
4
5
6
7
8
9
10
11
0
1
2
3
4
5
6
7
8
9
10
11
Attends
Masked
Sparsity
46%
Active Connections
78
Full Attention Would Be
78
Memory Savings
1.8×
Pattern Comparison
PatternComplexityLong ContextTrade-off
Full AttentionO(n²)❌ LimitedQuality baseline
Sliding WindowO(n × w)⚠️ Local onlyFast but may miss long-range
Dilated/StridedO(n × w)✓ ExpandedBetter coverage, fixed pattern
LongformerO(n × w + g × n)✓ Global + LocalBest of both worlds
BigBirdO(n × (w + g + r))✓ Global + LocalRandom helps coverage
Sparse Attention Trade-offs: Full attention captures all relationships but is O(n²). Sparse patterns reduce complexity to O(n) but must carefully choose which connections to keep. Longformer/BigBird combine local context (sliding) with global tokens for the best coverage at linear cost.

State Space Models and Alternatives

State Space Models like Mamba offer a different paradigm: O(n) training and inference with O(1) memory during inference. They use structured recurrence that can be computed efficiently via convolution during training.

Hybrid approaches like Jamba combine SSM layers with Transformer layers, getting benefits of both architectures.

RAG: Sidestepping the Problem

Retrieval-Augmented Generation avoids the long-context problem entirely. Instead of fitting everything into context, retrieve relevant chunks from an external knowledge base and include only those.

Trade-offs: latency (retrieval adds time), retrieval quality (wrong chunks = wrong answers), and complexity (another system to maintain).

Chapter 13

Practical Infrastructure Decisions

Let's translate all this technical understanding into practical deployment decisions. After years of deploying ML models, I've learned that the gap between "works in notebook" and "works in production" is vast.

Memory Planning

Start with back-of-envelope calculations:

  • Model weights: params × bytes_per_param (2 for FP16, 1 for INT8)
  • KV-cache: 2 × layers × kv_heads × head_dim × seq_len × batch × bytes
  • Activations: varies, but budget 10-20% overhead

Parallelism Strategy Selection

Rules of thumb:

  • < 15B: Single GPU or simple data parallelism
  • 15-70B: Tensor parallelism within a node (NVLink required)
  • 70-200B: Add pipeline parallelism across nodes
  • > 200B: Full 3D parallelism (DP + TP + PP)

Interactive: Deployment Calculator

Plan your deployment: enter model specs and requirements, get hardware recommendations.

LLM Deployment Calculator

Plan hardware requirements for your deployment

Model Weights
13.0 GB
Total KV-Cache
20.0 GB
2.0 GB per request
GPUs Required
1× A100 80GB
Parallelism
Single GPU
Performance Estimates (approximate)
Prefill Latency (TTFT)
368 ms
Token Latency (ITL)
0.1 ms
Est. Throughput
6686 tok/s
Recommendations
  • KV-cache dominates memory; consider GQA model or shorter context
Deployment Checklist
Fits on single GPU
NVLink available (for TP)
Meets throughput target
Memory headroom (>10%)

Key Metrics to Track

Time to First Token (TTFT): Interactive latency. Users notice delays beyond 500ms. Critical for chat applications.

Inter-Token Latency (ITL): Streaming smoothness. Should stay under 50ms for natural-feeling output.

Throughput: Tokens per second across all requests. The key metric for cost optimization. Continuous batching and speculative decoding can dramatically improve this.

Part III: Appendix & Reference

Looking forward and quick reference materials.

Chapter 14

What's Emerging

The field is moving fast. Here's what I'm watching:

Efficiency Focus

Inference costs dominate the economics now. Training is one-time; serving is forever. Expect more aggressive quantization (FP4, even binary for some operations), better speculative decoding, and distillation to smaller, faster models.

Long Context via Architecture

SSM hybrids are becoming practical. The dream: O(n) attention with O(n²) quality. Linear attention variants are improving. The quadratic wall may eventually fall.

Reasoning and Search

Chain-of-thought as explicit search. Tree of Thought explores multiple reasoning paths. Best-of-N sampling trades compute for quality. Inference-time compute scaling—using more compute at inference for harder problems—is an active research area.

The Agent Paradigm

LLMs as reasoning engines embedded in larger systems. Tool use (code execution, search, databases). Multi-step planning. The model becomes one component in an architecture, not the whole system.

Final Thoughts

I started programming on a Commodore 64, typing in BASIC listings from magazines. The idea that I'd one day be explaining how artificial systems can generate coherent text, write code, and reason about problems would have seemed like science fiction.

Yet here we are. The Transformer architecture—matrix multiplications, softmax, layer norm, all pieces that would be familiar to any linear algebra student—combined at scale produces capabilities that still surprise me.

What excites me most is that we're still in early days. The "Attention Is All You Need" paper is less than a decade old. Flash Attention is from 2022. The techniques I've described here will likely seem primitive in another decade.

The fundamentals, though—understanding the memory hierarchy, the parallelism trade-offs, the mathematical foundations—those will remain useful. Learn the principles, not just the current instantiations.

And keep building. That's still the best way to really understand.

Quick Reference

Key Formulas

Attention:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
KV-Cache Memory:
2 × layers × kv_heads × head_dim × seq_len × batch × bytes
Model Weights:
params × bytes_per_param (FP16=2, INT8=1, INT4=0.5)

Common Model Architectures

ModelParamsLayersd_modelHeads
LLaMA 7B6.7B32409632
LLaMA 13B13B40512040
LLaMA 70B70B80819264 (8 KV)
Mixtral 8x7B46.7B32409632 (8 KV)