Intermediate

Data Parallelism

Data parallelism is the simplest and most common form of distributed training: replicate the model on each GPU and split the training data across them.

How It Works

  1. Replicate

    Copy the entire model to each GPU. All replicas start with identical weights.

  2. Partition Data

    Split each training batch across GPUs. With 4 GPUs and batch size 256, each GPU processes 64 samples.

  3. Forward Pass

    Each GPU computes the forward pass on its local data shard independently.

  4. Backward Pass

    Each GPU computes gradients on its local data. Gradients differ because the data differs.

  5. AllReduce

    Average gradients across all GPUs using AllReduce. After this, all GPUs have identical averaged gradients.

  6. Update

    Each GPU applies the same averaged gradients, keeping all model replicas synchronized.

PyTorch DDP

Python - Complete DDP Training Script
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler

def main():
    # torchrun sets these environment variables
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    torch.cuda.set_device(rank)

    # Model: same initialization on all ranks
    model = MyModel().cuda(rank)
    model = DDP(model, device_ids=[rank])

    # Data: each rank gets a unique shard
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = DataLoader(dataset, batch_size=64, sampler=sampler,
                        num_workers=4, pin_memory=True)

    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(10):
        sampler.set_epoch(epoch)  # Different shuffle each epoch
        for batch in loader:
            inputs = batch["input"].cuda(rank)
            labels = batch["label"].cuda(rank)

            loss = model(inputs, labels)
            loss.backward()    # Gradients synced automatically
            optimizer.step()
            optimizer.zero_grad()

    dist.destroy_process_group()

# Launch: torchrun --nproc_per_node=4 train.py

Ring AllReduce

The most common AllReduce implementation uses a ring topology:

  • GPUs are arranged in a logical ring
  • Each GPU sends a chunk of its gradients to the next GPU in the ring
  • After N-1 steps (N = number of GPUs), all GPUs have the complete sum
  • Communication cost is proportional to model size, independent of GPU count

Effective Batch Size

With data parallelism, the effective batch size scales with the number of GPUs:

  • Per-GPU batch size: 64 samples
  • 4 GPUs: Effective batch size = 256
  • Learning rate scaling: Linear scaling rule — multiply LR by the number of GPUs
  • Warmup: Use learning rate warmup (1-5 epochs) when scaling to large batch sizes to maintain training stability

Gradient Accumulation

When per-GPU batch size is limited by memory, gradient accumulation simulates larger batches:

Python - Gradient Accumulation
accumulation_steps = 4  # Effective batch = 4 * per_gpu_batch * num_gpus

for i, batch in enumerate(loader):
    loss = model(batch) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
Key takeaway: Data parallelism replicates the model and splits data. PyTorch DDP with NCCL AllReduce provides near-linear scaling. Remember to scale the learning rate with the number of GPUs and use warmup for stability.