Multi-GPU Training Advanced
When a model is too large to fit on a single GPU, or training on one GPU is too slow, you need multi-GPU strategies. This lesson covers the parallelism techniques that enable training across multiple GPUs and multiple nodes, from simple data parallelism to advanced 3D parallelism used for frontier LLMs.
Parallelism Strategies
| Strategy | What's Split | Communication | Best For |
|---|---|---|---|
| Data Parallel | Training data | Gradient sync (AllReduce) | Model fits on 1 GPU |
| Model Parallel (Tensor) | Model layers across GPUs | Activations between GPUs | Very large layers |
| Pipeline Parallel | Model stages across GPUs | Activations between stages | Deep models |
| FSDP / ZeRO | Parameters, gradients, optimizer | Parameter gather on demand | Large models, memory-efficient |
| 3D Parallelism | All of the above combined | Mixed | Frontier LLM training |
PyTorch FSDP Example
Python
import torch from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import MixedPrecision # Wrap model with FSDP for memory-efficient distributed training mp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ) model = FSDP( model, mixed_precision=mp_policy, sharding_strategy=ShardingStrategy.FULL_SHARD, use_orig_params=True, ) # Launch with: torchrun --nproc_per_node=8 train.py
Infrastructure Requirements
- Intra-node — NVLink provides 600-900 GB/s between GPUs within a node. Use 8-GPU instances for tight coupling.
- Inter-node — EFA/InfiniBand provides 400-3200 Gbps between nodes. Use placement groups for co-location.
- Storage — High-throughput shared filesystem (FSx Lustre, Filestore) to avoid data loading bottlenecks.
- Orchestration — Use torchrun, SageMaker distributed, or Kubernetes with MPI operator.
Scaling Rule: Start with FSDP for models up to ~70B parameters on a single 8-GPU node. Only add multi-node training when you exceed single-node memory. Add tensor/pipeline parallelism for 100B+ parameter models.
Ready for Best Practices?
The final lesson covers GPU utilization optimization, profiling, and production operations.
Next: Best Practices →
Lilly Tech Systems