Adversarial Training

Lesson 5 of 7 in the Adversarial Attacks & Defenses course.

Adversarial Training: The Most Effective Defense

Adversarial training is the process of augmenting the training dataset with adversarial examples and training the model to correctly classify them. It is widely considered the most effective defense against adversarial attacks, particularly L-infinity bounded perturbations. The core idea is simple: if you train a model to handle adversarial inputs, it learns more robust decision boundaries.

How Adversarial Training Works

The standard adversarial training procedure proposed by Madry et al. (2017) replaces the standard training loss with a robust loss:

  1. For each training batch, generate adversarial examples using PGD
  2. Compute the loss on the adversarial examples instead of (or in addition to) the clean examples
  3. Update model parameters to minimize this adversarial loss
  4. Repeat until convergence
Python
import torch
import torch.nn.functional as F
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR

class AdversarialTrainer:
    """PGD-based adversarial training for robust model development."""

    def __init__(self, model, epsilon=8/255, alpha=2/255,
                 num_attack_steps=10, lr=0.1):
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_attack_steps = num_attack_steps
        self.optimizer = SGD(model.parameters(), lr=lr,
                            momentum=0.9, weight_decay=5e-4)
        self.scheduler = CosineAnnealingLR(self.optimizer, T_max=200)

    def pgd_inner(self, images, labels):
        """Inner maximization: find the worst-case perturbation."""
        adv = images.clone().detach()
        adv += torch.empty_like(adv).uniform_(-self.epsilon, self.epsilon)
        adv = torch.clamp(adv, 0.0, 1.0)

        for _ in range(self.num_attack_steps):
            adv.requires_grad_(True)
            loss = F.cross_entropy(self.model(adv), labels)
            loss.backward()

            with torch.no_grad():
                adv = adv + self.alpha * adv.grad.sign()
                delta = torch.clamp(adv - images, -self.epsilon, self.epsilon)
                adv = torch.clamp(images + delta, 0.0, 1.0)

        return adv.detach()

    def train_step(self, images, labels):
        """One step of adversarial training."""
        self.model.train()

        # Generate adversarial examples (inner maximization)
        adv_images = self.pgd_inner(images, labels)

        # Outer minimization: train on adversarial examples
        self.optimizer.zero_grad()
        outputs = self.model(adv_images)
        loss = F.cross_entropy(outputs, labels)
        loss.backward()
        self.optimizer.step()

        # Compute metrics
        predictions = outputs.argmax(dim=1)
        accuracy = (predictions == labels).float().mean().item()

        return loss.item(), accuracy

    def train_epoch(self, dataloader):
        """Train for one full epoch."""
        total_loss = 0
        total_acc = 0
        for images, labels in dataloader:
            images, labels = images.cuda(), labels.cuda()
            loss, acc = self.train_step(images, labels)
            total_loss += loss
            total_acc += acc
        self.scheduler.step()
        n = len(dataloader)
        return total_loss / n, total_acc / n
💡
Critical tip: Adversarial training is 3-10x more expensive than standard training because each training step requires running PGD (multiple forward and backward passes) to generate adversarial examples. Budget your compute resources accordingly.

The Accuracy-Robustness Trade-off

One of the key challenges of adversarial training is the trade-off between clean accuracy (performance on normal inputs) and robust accuracy (performance on adversarial inputs):

  • Standard training: ~95% clean accuracy, ~0% robust accuracy on CIFAR-10
  • Adversarial training: ~87% clean accuracy, ~50-55% robust accuracy on CIFAR-10
  • There is an inherent tension between fitting the clean data distribution precisely and being robust to perturbations around each data point

Techniques to Improve the Trade-off

  • TRADES loss: Balances clean and robust accuracy with a tunable parameter (lambda)
  • Curriculum adversarial training: Gradually increase perturbation strength during training
  • Label smoothing: Reduces overconfidence and can improve both clean and robust accuracy
  • Larger models: Wider and deeper networks have more capacity to learn both clean and robust features
  • Extra data: Using additional unlabeled data for semi-supervised adversarial training significantly improves both metrics

TRADES: Balanced Adversarial Training

The TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization) method explicitly balances clean and robust accuracy:

Python
def trades_loss(model, images, labels, epsilon, alpha, num_steps, beta=6.0):
    """TRADES loss for balanced adversarial training.

    Loss = CE(model(x), y) + beta * KL(model(x) || model(x_adv))

    The first term maintains clean accuracy.
    The second term encourages robust predictions.
    Beta controls the trade-off (higher = more robust, less accurate).
    """
    model.eval()
    # Generate adversarial examples using KL divergence
    adv = images.clone().detach() + torch.empty_like(images).uniform_(
        -0.001, 0.001)

    clean_logits = model(images)

    for _ in range(num_steps):
        adv.requires_grad_(True)
        adv_logits = model(adv)
        kl_loss = F.kl_div(
            F.log_softmax(adv_logits, dim=1),
            F.softmax(clean_logits.detach(), dim=1),
            reduction='batchmean'
        )
        kl_loss.backward()
        with torch.no_grad():
            adv = adv + alpha * adv.grad.sign()
            delta = torch.clamp(adv - images, -epsilon, epsilon)
            adv = torch.clamp(images + delta, 0.0, 1.0)

    model.train()
    # Compute TRADES loss
    clean_logits = model(images)
    adv_logits = model(adv.detach())

    clean_loss = F.cross_entropy(clean_logits, labels)
    robust_loss = F.kl_div(
        F.log_softmax(adv_logits, dim=1),
        F.softmax(clean_logits, dim=1),
        reduction='batchmean'
    )

    return clean_loss + beta * robust_loss

Practical Considerations

  • Perturbation budget: Match your training epsilon to the threat model. For CIFAR-10, 8/255 is standard. For ImageNet, 4/255 is common
  • PGD steps during training: 7-10 steps is sufficient for training. Evaluation should use 20+ steps
  • Learning rate schedule: Adversarial training benefits from longer training with slower learning rate decay
  • Model selection: Use robust accuracy (not clean accuracy) on a held-out set for early stopping and model selection
Warning: Adversarial training with FGSM instead of PGD can lead to catastrophic overfitting, where the model appears robust during training but is actually vulnerable. Always use multi-step PGD for adversarial training.

Summary

Adversarial training is the most reliable defense against adversarial attacks, but it comes with computational costs and accuracy trade-offs. TRADES provides a principled way to balance clean and robust performance. In the next lesson, we explore certified defenses that provide mathematical guarantees of robustness.