Intermediate

Model Parallelism

When a model is too large to fit on a single GPU, model parallelism splits the model itself across multiple devices.

Tensor Parallelism

Tensor parallelism splits individual layers across GPUs. For a large linear layer with weight matrix W, you can split W column-wise or row-wise:

Column parallel: Split W into [W1, W2]. GPU 1 computes X×W1, GPU 2 computes X×W2. Concatenate results.
Row parallel: Split W into rows. Each GPU computes a partial result. Sum (AllReduce) to get the final output.
Best for: Large attention and MLP layers in transformers. Requires high-bandwidth interconnect (NVLink).

Pipeline Parallelism

Pipeline parallelism assigns different layers to different GPUs. Data flows through the pipeline like an assembly line:

Approach	Description	Efficiency
Naive Pipeline	One micro-batch at a time	Very low — most GPUs idle (bubble)
GPipe	Split batch into micro-batches, pipeline them	Better — reduces bubble size
1F1B (Interleaved)	Alternate forward and backward passes	Best — minimal pipeline bubble

Pipeline Bubble

The pipeline bubble is the key inefficiency: at the start and end of each batch, some GPUs are idle waiting for data to flow through:

Bubble fraction ≈ (p - 1) / m, where p = pipeline stages, m = micro-batches
More micro-batches = smaller bubble = better efficiency
With 4 stages and 32 micro-batches: bubble = 3/32 ≈ 9% overhead

3D Parallelism

Large-scale LLM training combines all three forms of parallelism:

Tensor Parallelism (within node)
Split layers across GPUs connected by NVLink within a single machine. Requires highest bandwidth.
Pipeline Parallelism (across nodes)
Split layer groups across nodes. Lower bandwidth requirement (only activations between stages).
Data Parallelism (across pipeline replicas)
Replicate the entire pipeline and split data. AllReduce gradients across replicas.

Practical Considerations

Memory balance: First and last pipeline stages often use more memory (embeddings, output head). Balance layer assignment carefully.
Activation memory: Pipeline parallelism requires storing activations for all in-flight micro-batches. Use activation checkpointing to reduce this.
Communication pattern: Tensor parallelism needs AllReduce (high bandwidth). Pipeline needs point-to-point (low latency). Match to your interconnect.

Python - Simple Pipeline Parallelism Concept

import torch

# Assign layers to different GPUs
class PipelineModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # Stage 0: GPU 0
        self.embed = torch.nn.Embedding(50000, 4096).to("cuda:0")
        self.layers_0 = torch.nn.TransformerEncoderLayer(...).to("cuda:0")

        # Stage 1: GPU 1
        self.layers_1 = torch.nn.TransformerEncoderLayer(...).to("cuda:1")
        self.head = torch.nn.Linear(4096, 50000).to("cuda:1")

    def forward(self, x):
        x = self.embed(x.to("cuda:0"))
        x = self.layers_0(x)
        x = x.to("cuda:1")  # Transfer between stages
        x = self.layers_1(x)
        return self.head(x)

✅

Key takeaway: Model parallelism splits the model when it exceeds single-GPU memory. Tensor parallelism splits layers (needs NVLink), pipeline parallelism splits layer groups (tolerates lower bandwidth). Large models use 3D parallelism combining data, tensor, and pipeline strategies.

← Previous Data Parallelism Next → DeepSpeed