Advanced
DeepSpeed
DeepSpeed is Microsoft's deep learning optimization library that enables training of models with billions of parameters through the ZeRO (Zero Redundancy Optimizer) family of techniques.
ZeRO Stages
Standard data parallelism replicates everything on each GPU. ZeRO eliminates this redundancy progressively:
| Stage | What is Sharded | Memory per GPU (7B model, 4 GPUs) |
|---|---|---|
| Baseline (DDP) | Nothing — full replica | ~112 GB (weights + grads + optimizer) |
| ZeRO Stage 1 | Optimizer states | ~44 GB |
| ZeRO Stage 2 | Optimizer states + gradients | ~30 GB |
| ZeRO Stage 3 | Optimizer states + gradients + parameters | ~28 GB + as needed |
DeepSpeed Configuration
JSON - ds_config.json (ZeRO Stage 2)
{
"train_batch_size": 256,
"gradient_accumulation_steps": 4,
"fp16": { "enabled": true },
"zero_optimization": {
"stage": 2,
"offload_optimizer": { "device": "none" },
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true
},
"optimizer": {
"type": "AdamW",
"params": { "lr": 1e-4, "weight_decay": 0.01 }
}
}
Using DeepSpeed with PyTorch
Python - DeepSpeed Integration
import deepspeed # Initialize DeepSpeed model = MyLargeModel() model_engine, optimizer, _, _ = deepspeed.initialize( model=model, config="ds_config.json", ) for batch in dataloader: loss = model_engine(batch) model_engine.backward(loss) model_engine.step() # Launch: deepspeed --num_gpus=8 train.py
CPU & NVMe Offloading
ZeRO-Offload and ZeRO-Infinity extend GPU memory with CPU RAM and NVMe storage:
- CPU offloading: Move optimizer states and optionally parameters to CPU RAM. 10-100x more memory available but slower.
- NVMe offloading: Move data to NVMe SSDs. Virtually unlimited memory for training trillion-parameter models.
- Trade-off: Offloading increases memory capacity but reduces training speed due to data transfer overhead.
When to Use Each Stage
- Stage 1: Model fits in GPU memory with DDP. Use Stage 1 for mild memory savings with minimal overhead.
- Stage 2: Model training OOMs with DDP. Stage 2 is the sweet spot — significant memory savings with low communication overhead.
- Stage 3: Model parameters don't fit on a single GPU. Stage 3 enables training arbitrarily large models but has higher communication cost.
- Stage 3 + Offload: Not enough total GPU memory across all devices. Last resort — slowest but enables the largest models.
Key takeaway: DeepSpeed's ZeRO optimizer eliminates memory redundancy across GPUs. Start with Stage 2 for most large-model training. Graduate to Stage 3 when the model doesn't fit on a single GPU. Use offloading only when you've exhausted GPU memory.
Lilly Tech Systems