Intermediate

Data Parallelism

Data parallelism is the simplest and most common form of distributed training: replicate the model on each GPU and split the training data across them.

How It Works

Replicate
Copy the entire model to each GPU. All replicas start with identical weights.
Partition Data
Split each training batch across GPUs. With 4 GPUs and batch size 256, each GPU processes 64 samples.
Forward Pass
Each GPU computes the forward pass on its local data shard independently.
Backward Pass
Each GPU computes gradients on its local data. Gradients differ because the data differs.
AllReduce
Average gradients across all GPUs using AllReduce. After this, all GPUs have identical averaged gradients.
Update
Each GPU applies the same averaged gradients, keeping all model replicas synchronized.

PyTorch DDP

Python - Complete DDP Training Script

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def main():
    # torchrun sets these environment variables
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    torch.cuda.set_device(rank)

    # Model: same initialization on all ranks
    model = MyModel().cuda(rank)
    model = DDP(model, device_ids=[rank])

    # Data: each rank gets a unique shard
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = DataLoader(dataset, batch_size=64, sampler=sampler,
                        num_workers=4, pin_memory=True)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Different shuffle each epoch
        for batch in loader:
            inputs = batch["input"].cuda(rank)
            labels = batch["label"].cuda(rank)

            loss = model(inputs, labels)
            loss.backward()    # Gradients synced automatically
            optimizer.step()
            optimizer.zero_grad()

    dist.destroy_process_group()

# Launch: torchrun --nproc_per_node=4 train.py

Ring AllReduce

The most common AllReduce implementation uses a ring topology:

GPUs are arranged in a logical ring
Each GPU sends a chunk of its gradients to the next GPU in the ring
After N-1 steps (N = number of GPUs), all GPUs have the complete sum
Communication cost is proportional to model size, independent of GPU count

Effective Batch Size

With data parallelism, the effective batch size scales with the number of GPUs:

Per-GPU batch size: 64 samples
4 GPUs: Effective batch size = 256
Learning rate scaling: Linear scaling rule — multiply LR by the number of GPUs
Warmup: Use learning rate warmup (1-5 epochs) when scaling to large batch sizes to maintain training stability

Gradient Accumulation

When per-GPU batch size is limited by memory, gradient accumulation simulates larger batches:

Python - Gradient Accumulation

accumulation_steps = 4  # Effective batch = 4 * per_gpu_batch * num_gpus

for i, batch in enumerate(loader):
    loss = model(batch) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

✅

Key takeaway: Data parallelism replicates the model and splits data. PyTorch DDP with NCCL AllReduce provides near-linear scaling. Remember to scale the learning rate with the number of GPUs and use warmup for stability.

← Previous Introduction Next → Model Parallelism

Data Parallelism

How It Works

Replicate

Partition Data

Forward Pass

Backward Pass

AllReduce

Update

PyTorch DDP

Ring AllReduce

Effective Batch Size

Gradient Accumulation