Advanced

Training LLMs

Understand how Large Language Models are pre-trained, aligned with RLHF, and the massive compute and data requirements involved.

The Pre-training Process

Pre-training is the foundational phase where the model learns language by predicting the next token on massive text corpora. This is the most expensive and time-consuming phase of building an LLM.

Data Collection
Gather trillions of tokens from diverse sources: web pages, books, code, academic papers.
Data Cleaning
Filter low-quality content, deduplicate, remove PII, and balance data mixture.
Tokenization
Train a tokenizer on the corpus, then tokenize all text into token IDs.
Training
Train the model on next-token prediction using distributed computing across thousands of GPUs.
Alignment
Fine-tune the pre-trained model to follow instructions, be helpful, and be safe using RLHF or similar techniques.

Data Collection and Cleaning

The quality and diversity of training data directly determines model capabilities:

Data Source	Content	Typical Size
Common Crawl	Web pages (filtered and cleaned)	~60% of data mix
Books	Literature, textbooks, reference works	~8% of data mix
Code	GitHub, GitLab repositories	~15% of data mix
Academic	Papers, ArXiv, PubMed	~5% of data mix
Wikipedia	Encyclopedia articles in many languages	~5% of data mix
Curated	High-quality, domain-specific data	~7% of data mix

💡

Data quality matters more than quantity: The Phi series from Microsoft showed that carefully curated, high-quality data (textbooks, filtered web content) can produce surprisingly capable small models. LLaMA 3's performance leap was largely attributed to improved data quality and a 15T token training set.

Compute Requirements

Training a frontier LLM is extraordinarily expensive:

Model Size	GPUs (A100/H100)	Training Time	Estimated Cost
7B	64-256 GPUs	2-4 weeks	$100K-$500K
70B	512-2,048 GPUs	1-3 months	$2M-$10M
405B	16,000+ GPUs	2-4 months	$50M-$100M+
Frontier (1T+)	25,000+ GPUs	3-6 months	$100M-$500M+

Training Techniques

Mixed Precision Training

Use FP16 or BF16 instead of FP32 to reduce memory by 50% and increase throughput. BF16 is preferred for its wider dynamic range (avoids overflow issues).

Gradient Accumulation

Simulate larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Allows effective batch sizes of millions of tokens without proportional memory increase.

Parallelism Strategies

Data Parallelism

Replicate the model on each GPU, split data across GPUs, average gradients. Simple but limited by memory per GPU.

Tensor Parallelism

Split individual layers across GPUs. Each GPU holds a slice of each weight matrix. Requires high-bandwidth interconnect (NVLink).

Pipeline Parallelism

Assign different layers to different GPUs. Data flows through the pipeline. Introduces bubble overhead that micro-batching helps mitigate.

FSDP / ZeRO

Shard optimizer states, gradients, and parameters across GPUs. Reconstructs full parameters on-demand during forward/backward pass.

RLHF (Reinforcement Learning from Human Feedback)

After pre-training, RLHF aligns the model with human preferences:

Supervised Fine-Tuning (SFT)
Fine-tune the pre-trained model on high-quality instruction-following examples written by humans.
Reward Model Training
Train a separate model to predict which of two responses a human would prefer. Uses thousands of human comparison judgments.
PPO Optimization
Use Proximal Policy Optimization to update the LLM to maximize the reward model's score while staying close to the SFT model (KL divergence penalty).

✅

Why RLHF matters: Pre-training produces a model that can complete text, but it doesn't naturally follow instructions or refuse harmful requests. RLHF transforms a text completion engine into a helpful, harmless assistant.

Constitutional AI (CAI)

Developed by Anthropic as an alternative to pure RLHF. Instead of relying entirely on human feedback, the model critiques and revises its own outputs based on a set of principles (a "constitution"):

The model generates a response to a prompt.
It then critiques its own response against constitutional principles (e.g., "Is this response harmful?").
It revises the response based on the critique.
The revised responses are used to train a preference model (RLAIF — RL from AI Feedback).

DPO (Direct Preference Optimization)

A simpler alternative to RLHF that skips the reward model entirely. DPO directly optimizes the language model using preference pairs:

Concept — DPO training

# DPO training data format:
{
  "prompt": "Explain quantum computing simply",
  "chosen": "Quantum computing uses quantum bits (qubits) that can be 0 and 1 simultaneously...",
  "rejected": "Quantum computing is a complex field involving Hilbert spaces and unitary transformations..."
}

# DPO directly adjusts model probabilities:
# - Increase probability of "chosen" responses
# - Decrease probability of "rejected" responses
# - No separate reward model needed!

💡

RLHF vs DPO: RLHF is more complex but potentially more powerful for fine-grained alignment. DPO is simpler to implement and increasingly popular. Many recent open models (Zephyr, Neural-Chat) use DPO successfully.

← Previous How LLMs Work Next → Fine-tuning

Training LLMs

The Pre-training Process

Data Collection

Data Cleaning

Tokenization

Training

Alignment

Data Collection and Cleaning

Compute Requirements

Training Techniques

Mixed Precision Training

Gradient Accumulation

Parallelism Strategies

Data Parallelism

Tensor Parallelism

Pipeline Parallelism

FSDP / ZeRO

RLHF (Reinforcement Learning from Human Feedback)

Supervised Fine-Tuning (SFT)

Reward Model Training

PPO Optimization

Constitutional AI (CAI)

DPO (Direct Preference Optimization)