Advanced

DeepSpeed

DeepSpeed is Microsoft's deep learning optimization library that enables training of models with billions of parameters through the ZeRO (Zero Redundancy Optimizer) family of techniques.

ZeRO Stages

Standard data parallelism replicates everything on each GPU. ZeRO eliminates this redundancy progressively:

StageWhat is ShardedMemory per GPU (7B model, 4 GPUs)
Baseline (DDP)Nothing — full replica~112 GB (weights + grads + optimizer)
ZeRO Stage 1Optimizer states~44 GB
ZeRO Stage 2Optimizer states + gradients~30 GB
ZeRO Stage 3Optimizer states + gradients + parameters~28 GB + as needed

DeepSpeed Configuration

JSON - ds_config.json (ZeRO Stage 2)
{
  "train_batch_size": 256,
  "gradient_accumulation_steps": 4,
  "fp16": { "enabled": true },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": { "device": "none" },
    "contiguous_gradients": true,
    "overlap_comm": true,
    "reduce_scatter": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": { "lr": 1e-4, "weight_decay": 0.01 }
  }
}

Using DeepSpeed with PyTorch

Python - DeepSpeed Integration
import deepspeed

# Initialize DeepSpeed
model = MyLargeModel()
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config="ds_config.json",
)

for batch in dataloader:
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

# Launch: deepspeed --num_gpus=8 train.py

CPU & NVMe Offloading

ZeRO-Offload and ZeRO-Infinity extend GPU memory with CPU RAM and NVMe storage:

  • CPU offloading: Move optimizer states and optionally parameters to CPU RAM. 10-100x more memory available but slower.
  • NVMe offloading: Move data to NVMe SSDs. Virtually unlimited memory for training trillion-parameter models.
  • Trade-off: Offloading increases memory capacity but reduces training speed due to data transfer overhead.

When to Use Each Stage

  • Stage 1: Model fits in GPU memory with DDP. Use Stage 1 for mild memory savings with minimal overhead.
  • Stage 2: Model training OOMs with DDP. Stage 2 is the sweet spot — significant memory savings with low communication overhead.
  • Stage 3: Model parameters don't fit on a single GPU. Stage 3 enables training arbitrarily large models but has higher communication cost.
  • Stage 3 + Offload: Not enough total GPU memory across all devices. Last resort — slowest but enables the largest models.
Key takeaway: DeepSpeed's ZeRO optimizer eliminates memory redundancy across GPUs. Start with Stage 2 for most large-model training. Graduate to Stage 3 when the model doesn't fit on a single GPU. Use offloading only when you've exhausted GPU memory.