Distributed Training Best Practices
Practical guidance for choosing distributed strategies, debugging multi-GPU jobs, reliable checkpointing, and production-scale training.
Strategy Selection Flowchart
| Situation | Recommended Strategy |
|---|---|
| Model fits on 1 GPU, want faster training | DDP — simplest, near-linear scaling |
| Model fits on 1 GPU but OOMs with optimizer | FSDP SHARD_GRAD_OP or DeepSpeed ZeRO-2 |
| Model doesn't fit on 1 GPU | FSDP FULL_SHARD or DeepSpeed ZeRO-3 |
| Very large model, multi-node | 3D parallelism (tensor + pipeline + data) |
| Need CPU/NVMe offloading | DeepSpeed ZeRO-3 + Offload |
Debugging Distributed Training
Start with 1 GPU
Verify your training loop works on a single GPU before adding distribution. Many bugs are easier to find without the distributed complexity.
Check Process Group
Ensure all ranks can communicate. Common issues: firewall blocking NCCL ports, wrong MASTER_ADDR/MASTER_PORT, mismatched world_size.
Use NCCL Debug Logging
Set
NCCL_DEBUG=INFOto see communication details. SetTORCH_DISTRIBUTED_DEBUG=DETAILfor PyTorch-level debugging.Watch for Hangs
Distributed hangs usually mean one rank crashed while others wait for AllReduce. Check all rank logs. Use
NCCL_TIMEOUTto fail fast.Verify Gradient Sync
After a few steps, compare model weights across ranks. They should be identical. If they diverge, gradient sync is broken.
Checkpointing
- Frequency: Checkpoint every 30-60 minutes for large training runs. Lost compute from crashes is expensive.
- Async checkpointing: Save checkpoints in a background thread to avoid blocking training.
- Sharded checkpoints: With FSDP/ZeRO-3, save sharded state dicts for fast save/load. Convert to full state dict only for inference.
- Cloud storage: Write checkpoints to S3/GCS, not local disk. Local disk is lost if the node is preempted.
Fault Tolerance
Large-scale training runs spanning days or weeks will encounter hardware failures:
- Elastic training: Use
torchrun --rdzv-backend=c10dfor automatic recovery when nodes fail and rejoin. - Preemption handling: On spot instances, catch SIGTERM and save a checkpoint before the instance is terminated.
- Health checks: Monitor GPU temperature, memory usage, and training loss. Automatic restart on anomalies.
- Gradient clipping: Always clip gradients (max_norm=1.0) to prevent training instability from one bad batch.
Frequently Asked Questions
Use FSDP if you want native PyTorch with simpler setup and debugging. Use DeepSpeed if you need CPU/NVMe offloading, ZeRO++ optimizations, or are training very large models (100B+). HuggingFace Accelerate supports both, making it easy to switch.
Use the linear scaling rule: multiply the base learning rate by the number of GPUs (or effective batch size ratio). Always use a warmup period (typically 1-5% of total steps) when scaling to large batch sizes. For very large batches (>8K), consider LARS or LAMB optimizers.
Common causes: (1) Data loading is the bottleneck — increase num_workers and use pin_memory. (2) Communication overhead — check interconnect bandwidth with nccl-tests. (3) Small batch size per GPU — GPU is underutilized. (4) Gradient accumulation — reduces communication but doesn't parallelize compute. Profile with PyTorch Profiler to identify the bottleneck.
Lilly Tech Systems