Deep Dive Series

How Transformer LLMs Actually Work

A comprehensive, interactive journey from attention mechanisms to production deployment. Built by someone who's been writing code since the Commodore 64 era.

14 chapters across 3 parts

~45 min total read

23 interactive visualizations

Choose where to start

Part IStart here

Setting the Stage

3 chapters: The Problem That Needed Solving, Attention Is All You Need, Training Objective

~15 min read • 7 interactive visualizations

Part II

From Transformer to Modern LLMs

10 chapters: Decoder-Only Revolution, Tokenization, Modern Positional Encoding + 7 more

~25 min read • 16 interactive visualizations

Part III

What's Emerging

6 chapters: The Reasoning Revolution, State Space Models, The Efficiency Frontier + 3 more

~5 min read • Series finale

What You'll Learn

Part I: Foundations

• Why RNNs failed and how attention solved it
• Self-attention, multi-head attention, positional encoding
• The surprisingly simple training objective
• Emergent capabilities at scale

Part II: Modern Architecture

• Decoder-only vs encoder-decoder
• Tokenization (BPE, WordPiece)
• RoPE, ALiBi, and modern positional encoding
• GQA, MQA, and attention variants

Part II: Efficiency & Scale

• Mixture of Experts (MoE) architecture
• Flash Attention and IO-aware algorithms
• Distributed training (DP, TP, PP, ZeRO)
• Inference optimization and speculative decoding

Part III: The Future

• Long context handling strategies
• State Space Models (Mamba, Jamba)
• The agent paradigm and tool use
• Key formulas and reference materials

Ready to dive in?

Start with Part I to understand the foundations, or jump to any section that interests you. Each part stands on its own while building on previous concepts.

Start with Part I

← Back to All Articles