Training LLMs
Understand how Large Language Models are pre-trained, aligned with RLHF, and the massive compute and data requirements involved.
The Pre-training Process
Pre-training is the foundational phase where the model learns language by predicting the next token on massive text corpora. This is the most expensive and time-consuming phase of building an LLM.
Data Collection
Gather trillions of tokens from diverse sources: web pages, books, code, academic papers.
Data Cleaning
Filter low-quality content, deduplicate, remove PII, and balance data mixture.
Tokenization
Train a tokenizer on the corpus, then tokenize all text into token IDs.
Training
Train the model on next-token prediction using distributed computing across thousands of GPUs.
Alignment
Fine-tune the pre-trained model to follow instructions, be helpful, and be safe using RLHF or similar techniques.
Data Collection and Cleaning
The quality and diversity of training data directly determines model capabilities:
| Data Source | Content | Typical Size |
|---|---|---|
| Common Crawl | Web pages (filtered and cleaned) | ~60% of data mix |
| Books | Literature, textbooks, reference works | ~8% of data mix |
| Code | GitHub, GitLab repositories | ~15% of data mix |
| Academic | Papers, ArXiv, PubMed | ~5% of data mix |
| Wikipedia | Encyclopedia articles in many languages | ~5% of data mix |
| Curated | High-quality, domain-specific data | ~7% of data mix |
Compute Requirements
Training a frontier LLM is extraordinarily expensive:
| Model Size | GPUs (A100/H100) | Training Time | Estimated Cost |
|---|---|---|---|
| 7B | 64-256 GPUs | 2-4 weeks | $100K-$500K |
| 70B | 512-2,048 GPUs | 1-3 months | $2M-$10M |
| 405B | 16,000+ GPUs | 2-4 months | $50M-$100M+ |
| Frontier (1T+) | 25,000+ GPUs | 3-6 months | $100M-$500M+ |
Training Techniques
Mixed Precision Training
Use FP16 or BF16 instead of FP32 to reduce memory by 50% and increase throughput. BF16 is preferred for its wider dynamic range (avoids overflow issues).
Gradient Accumulation
Simulate larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Allows effective batch sizes of millions of tokens without proportional memory increase.
Parallelism Strategies
Data Parallelism
Replicate the model on each GPU, split data across GPUs, average gradients. Simple but limited by memory per GPU.
Tensor Parallelism
Split individual layers across GPUs. Each GPU holds a slice of each weight matrix. Requires high-bandwidth interconnect (NVLink).
Pipeline Parallelism
Assign different layers to different GPUs. Data flows through the pipeline. Introduces bubble overhead that micro-batching helps mitigate.
FSDP / ZeRO
Shard optimizer states, gradients, and parameters across GPUs. Reconstructs full parameters on-demand during forward/backward pass.
RLHF (Reinforcement Learning from Human Feedback)
After pre-training, RLHF aligns the model with human preferences:
Supervised Fine-Tuning (SFT)
Fine-tune the pre-trained model on high-quality instruction-following examples written by humans.
Reward Model Training
Train a separate model to predict which of two responses a human would prefer. Uses thousands of human comparison judgments.
PPO Optimization
Use Proximal Policy Optimization to update the LLM to maximize the reward model's score while staying close to the SFT model (KL divergence penalty).
Constitutional AI (CAI)
Developed by Anthropic as an alternative to pure RLHF. Instead of relying entirely on human feedback, the model critiques and revises its own outputs based on a set of principles (a "constitution"):
- The model generates a response to a prompt.
- It then critiques its own response against constitutional principles (e.g., "Is this response harmful?").
- It revises the response based on the critique.
- The revised responses are used to train a preference model (RLAIF — RL from AI Feedback).
DPO (Direct Preference Optimization)
A simpler alternative to RLHF that skips the reward model entirely. DPO directly optimizes the language model using preference pairs:
# DPO training data format:
{
"prompt": "Explain quantum computing simply",
"chosen": "Quantum computing uses quantum bits (qubits) that can be 0 and 1 simultaneously...",
"rejected": "Quantum computing is a complex field involving Hilbert spaces and unitary transformations..."
}
# DPO directly adjusts model probabilities:
# - Increase probability of "chosen" responses
# - Decrease probability of "rejected" responses
# - No separate reward model needed!
Lilly Tech Systems