Advanced

Training LLMs

Understand how Large Language Models are pre-trained, aligned with RLHF, and the massive compute and data requirements involved.

The Pre-training Process

Pre-training is the foundational phase where the model learns language by predicting the next token on massive text corpora. This is the most expensive and time-consuming phase of building an LLM.

  1. Data Collection

    Gather trillions of tokens from diverse sources: web pages, books, code, academic papers.

  2. Data Cleaning

    Filter low-quality content, deduplicate, remove PII, and balance data mixture.

  3. Tokenization

    Train a tokenizer on the corpus, then tokenize all text into token IDs.

  4. Training

    Train the model on next-token prediction using distributed computing across thousands of GPUs.

  5. Alignment

    Fine-tune the pre-trained model to follow instructions, be helpful, and be safe using RLHF or similar techniques.

Data Collection and Cleaning

The quality and diversity of training data directly determines model capabilities:

Data SourceContentTypical Size
Common CrawlWeb pages (filtered and cleaned)~60% of data mix
BooksLiterature, textbooks, reference works~8% of data mix
CodeGitHub, GitLab repositories~15% of data mix
AcademicPapers, ArXiv, PubMed~5% of data mix
WikipediaEncyclopedia articles in many languages~5% of data mix
CuratedHigh-quality, domain-specific data~7% of data mix
💡
Data quality matters more than quantity: The Phi series from Microsoft showed that carefully curated, high-quality data (textbooks, filtered web content) can produce surprisingly capable small models. LLaMA 3's performance leap was largely attributed to improved data quality and a 15T token training set.

Compute Requirements

Training a frontier LLM is extraordinarily expensive:

Model SizeGPUs (A100/H100)Training TimeEstimated Cost
7B64-256 GPUs2-4 weeks$100K-$500K
70B512-2,048 GPUs1-3 months$2M-$10M
405B16,000+ GPUs2-4 months$50M-$100M+
Frontier (1T+)25,000+ GPUs3-6 months$100M-$500M+

Training Techniques

Mixed Precision Training

Use FP16 or BF16 instead of FP32 to reduce memory by 50% and increase throughput. BF16 is preferred for its wider dynamic range (avoids overflow issues).

Gradient Accumulation

Simulate larger batch sizes by accumulating gradients over multiple forward passes before updating weights. Allows effective batch sizes of millions of tokens without proportional memory increase.

Parallelism Strategies

Data Parallelism

Replicate the model on each GPU, split data across GPUs, average gradients. Simple but limited by memory per GPU.

Tensor Parallelism

Split individual layers across GPUs. Each GPU holds a slice of each weight matrix. Requires high-bandwidth interconnect (NVLink).

Pipeline Parallelism

Assign different layers to different GPUs. Data flows through the pipeline. Introduces bubble overhead that micro-batching helps mitigate.

FSDP / ZeRO

Shard optimizer states, gradients, and parameters across GPUs. Reconstructs full parameters on-demand during forward/backward pass.

RLHF (Reinforcement Learning from Human Feedback)

After pre-training, RLHF aligns the model with human preferences:

  1. Supervised Fine-Tuning (SFT)

    Fine-tune the pre-trained model on high-quality instruction-following examples written by humans.

  2. Reward Model Training

    Train a separate model to predict which of two responses a human would prefer. Uses thousands of human comparison judgments.

  3. PPO Optimization

    Use Proximal Policy Optimization to update the LLM to maximize the reward model's score while staying close to the SFT model (KL divergence penalty).

Why RLHF matters: Pre-training produces a model that can complete text, but it doesn't naturally follow instructions or refuse harmful requests. RLHF transforms a text completion engine into a helpful, harmless assistant.

Constitutional AI (CAI)

Developed by Anthropic as an alternative to pure RLHF. Instead of relying entirely on human feedback, the model critiques and revises its own outputs based on a set of principles (a "constitution"):

  • The model generates a response to a prompt.
  • It then critiques its own response against constitutional principles (e.g., "Is this response harmful?").
  • It revises the response based on the critique.
  • The revised responses are used to train a preference model (RLAIF — RL from AI Feedback).

DPO (Direct Preference Optimization)

A simpler alternative to RLHF that skips the reward model entirely. DPO directly optimizes the language model using preference pairs:

Concept — DPO training
# DPO training data format:
{
  "prompt": "Explain quantum computing simply",
  "chosen": "Quantum computing uses quantum bits (qubits) that can be 0 and 1 simultaneously...",
  "rejected": "Quantum computing is a complex field involving Hilbert spaces and unitary transformations..."
}

# DPO directly adjusts model probabilities:
# - Increase probability of "chosen" responses
# - Decrease probability of "rejected" responses
# - No separate reward model needed!
💡
RLHF vs DPO: RLHF is more complex but potentially more powerful for fine-grained alignment. DPO is simpler to implement and increasingly popular. Many recent open models (Zephyr, Neural-Chat) use DPO successfully.