Intermediate

How LLMs Work

Dive into the internals of Large Language Models — from tokenization and self-attention to pre-training objectives and emergent abilities.

Transformer Architecture Recap

Modern LLMs are built on the decoder-only Transformer architecture (GPT-style). The key components are:

Token Embedding
Convert input tokens to dense vector representations. Each token gets a high-dimensional vector (e.g., 4096 dimensions for a 7B model).
Positional Encoding
Add position information so the model knows where each token appears. Modern models use RoPE (Rotary Position Embeddings) for better length generalization.
Transformer Blocks (N layers)
Each block contains a multi-head self-attention layer followed by a feed-forward network, with layer normalization and residual connections.
Output Head
Project the final hidden state to vocabulary-sized logits. Apply softmax to get probability distribution over next tokens.

Tokenization

LLMs don't process raw text. They convert text into tokens — subword units that balance vocabulary size with representation efficiency.

Byte-Pair Encoding (BPE)

The most common tokenization method. Starts with individual characters and iteratively merges the most frequent pairs:

Example — BPE tokenization

# "Understanding" might be tokenized as:
["Under", "stand", "ing"]

# "unhappiness" might become:
["un", "happiness"]     # or ["un", "happ", "iness"]

# Common words are single tokens:
"the" -> ["the"]
"hello" -> ["hello"]

# Rare words get split into subwords:
"antidisestablishmentarianism" -> ["anti", "dis", "establish", "ment", "arian", "ism"]

💡

Token != Word: A token is typically 3-4 characters on average in English. "ChatGPT is amazing" is 4 tokens, but "supercalifragilisticexpialidocious" might be 7+ tokens. Tokenization efficiency varies by language — some languages use 2-3x more tokens per word.

SentencePiece

Used by LLaMA, Mistral, and other models. Treats text as a raw byte stream, making it language-agnostic. Works directly on Unicode without pre-tokenization.

Self-Attention Mechanism

Self-attention is the core innovation that makes Transformers powerful. For each token, it computes how much attention to pay to every other token in the sequence:

Math — Scaled dot-product attention

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Where:
  Q = Query matrix (what am I looking for?)
  K = Key matrix (what do I contain?)
  V = Value matrix (what information do I provide?)
  d_k = dimension of keys (for numerical stability)

Multi-head attention runs multiple attention computations in parallel (e.g., 32 heads), allowing the model to attend to different types of relationships simultaneously — syntax in one head, semantics in another, coreference in a third.

✅

Causal masking: In decoder-only models (GPT, LLaMA, Claude), each token can only attend to tokens that come before it. This ensures the model generates text left-to-right, one token at a time.

Pre-training Objectives

Next Token Prediction (Causal LM)

Used by GPT, LLaMA, Claude, and most modern LLMs. Given a sequence of tokens, predict the next token. Trained on trillions of tokens from the internet, books, code, and more.

Example — Next token prediction

Input:  "The capital of France is"
Target: "Paris"

Input:  "def fibonacci(n):\n    if n <= 1:\n        return"
Target: " n"

Masked Language Modeling (MLM)

Used by BERT and encoder models. Randomly mask tokens and predict them. Provides bidirectional context but is less natural for generation:

Example — Masked language modeling

Input:  "The [MASK] of France is Paris"
Target: "capital"

Emergence and Scaling Laws

As models scale in parameters, data, and compute, they exhibit emergent abilities — capabilities that appear suddenly at certain scales:

Few-shot learning: Appears around 10B+ parameters. The model learns tasks from just a few examples in the prompt.
Chain-of-thought reasoning: Appears around 60B+ parameters. The model can break down complex problems into steps.
Code generation: Improves dramatically above 30B+ parameters with code-heavy training data.

Scaling laws (Kaplan et al., Chinchilla) show that model performance improves predictably as a power law of compute, data, and parameters. The Chinchilla optimal ratio suggests training tokens should be roughly 20x the parameter count.

In-Context Learning

One of the most surprising abilities of LLMs: they can learn new tasks from examples provided in the prompt, without any parameter updates:

Example — Few-shot in-context learning

# Zero-shot (no examples)
"Classify the sentiment: 'This movie was terrible' -> "

# One-shot (one example)
"Classify the sentiment:
'I loved it' -> Positive
'This movie was terrible' -> "

# Few-shot (multiple examples)
"Classify the sentiment:
'I loved it' -> Positive
'Worst experience ever' -> Negative
'It was okay' -> Neutral
'This movie was terrible' -> "

Chain of Thought Reasoning

Prompting the model to "think step by step" dramatically improves performance on reasoning tasks:

Example — Chain of thought

# Without CoT:
"Q: If a store has 3 shelves with 8 items each, and 5 items are sold, how many remain?
A: 19"

# With CoT:
"Q: If a store has 3 shelves with 8 items each, and 5 items are sold, how many remain?
A: Let me think step by step.
- 3 shelves with 8 items each = 3 x 8 = 24 items total
- 5 items are sold, so 24 - 5 = 19 items remain
The answer is 19."

← Previous Introduction Next → Training LLMs

How LLMs Work

Transformer Architecture Recap

Token Embedding

Positional Encoding

Transformer Blocks (N layers)

Output Head

Tokenization

Byte-Pair Encoding (BPE)

SentencePiece

Self-Attention Mechanism

Pre-training Objectives

Next Token Prediction (Causal LM)

Masked Language Modeling (MLM)

Emergence and Scaling Laws

In-Context Learning

Chain of Thought Reasoning