How LLMs Work
Dive into the internals of Large Language Models — from tokenization and self-attention to pre-training objectives and emergent abilities.
Transformer Architecture Recap
Modern LLMs are built on the decoder-only Transformer architecture (GPT-style). The key components are:
Token Embedding
Convert input tokens to dense vector representations. Each token gets a high-dimensional vector (e.g., 4096 dimensions for a 7B model).
Positional Encoding
Add position information so the model knows where each token appears. Modern models use RoPE (Rotary Position Embeddings) for better length generalization.
Transformer Blocks (N layers)
Each block contains a multi-head self-attention layer followed by a feed-forward network, with layer normalization and residual connections.
Output Head
Project the final hidden state to vocabulary-sized logits. Apply softmax to get probability distribution over next tokens.
Tokenization
LLMs don't process raw text. They convert text into tokens — subword units that balance vocabulary size with representation efficiency.
Byte-Pair Encoding (BPE)
The most common tokenization method. Starts with individual characters and iteratively merges the most frequent pairs:
# "Understanding" might be tokenized as:
["Under", "stand", "ing"]
# "unhappiness" might become:
["un", "happiness"] # or ["un", "happ", "iness"]
# Common words are single tokens:
"the" -> ["the"]
"hello" -> ["hello"]
# Rare words get split into subwords:
"antidisestablishmentarianism" -> ["anti", "dis", "establish", "ment", "arian", "ism"]
SentencePiece
Used by LLaMA, Mistral, and other models. Treats text as a raw byte stream, making it language-agnostic. Works directly on Unicode without pre-tokenization.
Self-Attention Mechanism
Self-attention is the core innovation that makes Transformers powerful. For each token, it computes how much attention to pay to every other token in the sequence:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Where:
Q = Query matrix (what am I looking for?)
K = Key matrix (what do I contain?)
V = Value matrix (what information do I provide?)
d_k = dimension of keys (for numerical stability)
Multi-head attention runs multiple attention computations in parallel (e.g., 32 heads), allowing the model to attend to different types of relationships simultaneously — syntax in one head, semantics in another, coreference in a third.
Pre-training Objectives
Next Token Prediction (Causal LM)
Used by GPT, LLaMA, Claude, and most modern LLMs. Given a sequence of tokens, predict the next token. Trained on trillions of tokens from the internet, books, code, and more.
Input: "The capital of France is"
Target: "Paris"
Input: "def fibonacci(n):\n if n <= 1:\n return"
Target: " n"
Masked Language Modeling (MLM)
Used by BERT and encoder models. Randomly mask tokens and predict them. Provides bidirectional context but is less natural for generation:
Input: "The [MASK] of France is Paris"
Target: "capital"
Emergence and Scaling Laws
As models scale in parameters, data, and compute, they exhibit emergent abilities — capabilities that appear suddenly at certain scales:
- Few-shot learning: Appears around 10B+ parameters. The model learns tasks from just a few examples in the prompt.
- Chain-of-thought reasoning: Appears around 60B+ parameters. The model can break down complex problems into steps.
- Code generation: Improves dramatically above 30B+ parameters with code-heavy training data.
Scaling laws (Kaplan et al., Chinchilla) show that model performance improves predictably as a power law of compute, data, and parameters. The Chinchilla optimal ratio suggests training tokens should be roughly 20x the parameter count.
In-Context Learning
One of the most surprising abilities of LLMs: they can learn new tasks from examples provided in the prompt, without any parameter updates:
# Zero-shot (no examples)
"Classify the sentiment: 'This movie was terrible' -> "
# One-shot (one example)
"Classify the sentiment:
'I loved it' -> Positive
'This movie was terrible' -> "
# Few-shot (multiple examples)
"Classify the sentiment:
'I loved it' -> Positive
'Worst experience ever' -> Negative
'It was okay' -> Neutral
'This movie was terrible' -> "
Chain of Thought Reasoning
Prompting the model to "think step by step" dramatically improves performance on reasoning tasks:
# Without CoT:
"Q: If a store has 3 shelves with 8 items each, and 5 items are sold, how many remain?
A: 19"
# With CoT:
"Q: If a store has 3 shelves with 8 items each, and 5 items are sold, how many remain?
A: Let me think step by step.
- 3 shelves with 8 items each = 3 x 8 = 24 items total
- 5 items are sold, so 24 - 5 = 19 items remain
The answer is 19."
Lilly Tech Systems