Model Parallelism
When a model is too large to fit on a single GPU, model parallelism splits the model itself across multiple devices.
Tensor Parallelism
Tensor parallelism splits individual layers across GPUs. For a large linear layer with weight matrix W, you can split W column-wise or row-wise:
- Column parallel: Split W into [W1, W2]. GPU 1 computes X×W1, GPU 2 computes X×W2. Concatenate results.
- Row parallel: Split W into rows. Each GPU computes a partial result. Sum (AllReduce) to get the final output.
- Best for: Large attention and MLP layers in transformers. Requires high-bandwidth interconnect (NVLink).
Pipeline Parallelism
Pipeline parallelism assigns different layers to different GPUs. Data flows through the pipeline like an assembly line:
| Approach | Description | Efficiency |
|---|---|---|
| Naive Pipeline | One micro-batch at a time | Very low — most GPUs idle (bubble) |
| GPipe | Split batch into micro-batches, pipeline them | Better — reduces bubble size |
| 1F1B (Interleaved) | Alternate forward and backward passes | Best — minimal pipeline bubble |
Pipeline Bubble
The pipeline bubble is the key inefficiency: at the start and end of each batch, some GPUs are idle waiting for data to flow through:
- Bubble fraction ≈ (p - 1) / m, where p = pipeline stages, m = micro-batches
- More micro-batches = smaller bubble = better efficiency
- With 4 stages and 32 micro-batches: bubble = 3/32 ≈ 9% overhead
3D Parallelism
Large-scale LLM training combines all three forms of parallelism:
Tensor Parallelism (within node)
Split layers across GPUs connected by NVLink within a single machine. Requires highest bandwidth.
Pipeline Parallelism (across nodes)
Split layer groups across nodes. Lower bandwidth requirement (only activations between stages).
Data Parallelism (across pipeline replicas)
Replicate the entire pipeline and split data. AllReduce gradients across replicas.
Practical Considerations
- Memory balance: First and last pipeline stages often use more memory (embeddings, output head). Balance layer assignment carefully.
- Activation memory: Pipeline parallelism requires storing activations for all in-flight micro-batches. Use activation checkpointing to reduce this.
- Communication pattern: Tensor parallelism needs AllReduce (high bandwidth). Pipeline needs point-to-point (low latency). Match to your interconnect.
import torch # Assign layers to different GPUs class PipelineModel(torch.nn.Module): def __init__(self): super().__init__() # Stage 0: GPU 0 self.embed = torch.nn.Embedding(50000, 4096).to("cuda:0") self.layers_0 = torch.nn.TransformerEncoderLayer(...).to("cuda:0") # Stage 1: GPU 1 self.layers_1 = torch.nn.TransformerEncoderLayer(...).to("cuda:1") self.head = torch.nn.Linear(4096, 50000).to("cuda:1") def forward(self, x): x = self.embed(x.to("cuda:0")) x = self.layers_0(x) x = x.to("cuda:1") # Transfer between stages x = self.layers_1(x) return self.head(x)
Lilly Tech Systems