Adversarial Training
Lesson 5 of 7 in the Adversarial Attacks & Defenses course.
Adversarial Training: The Most Effective Defense
Adversarial training is the process of augmenting the training dataset with adversarial examples and training the model to correctly classify them. It is widely considered the most effective defense against adversarial attacks, particularly L-infinity bounded perturbations. The core idea is simple: if you train a model to handle adversarial inputs, it learns more robust decision boundaries.
How Adversarial Training Works
The standard adversarial training procedure proposed by Madry et al. (2017) replaces the standard training loss with a robust loss:
- For each training batch, generate adversarial examples using PGD
- Compute the loss on the adversarial examples instead of (or in addition to) the clean examples
- Update model parameters to minimize this adversarial loss
- Repeat until convergence
import torch
import torch.nn.functional as F
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR
class AdversarialTrainer:
"""PGD-based adversarial training for robust model development."""
def __init__(self, model, epsilon=8/255, alpha=2/255,
num_attack_steps=10, lr=0.1):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_attack_steps = num_attack_steps
self.optimizer = SGD(model.parameters(), lr=lr,
momentum=0.9, weight_decay=5e-4)
self.scheduler = CosineAnnealingLR(self.optimizer, T_max=200)
def pgd_inner(self, images, labels):
"""Inner maximization: find the worst-case perturbation."""
adv = images.clone().detach()
adv += torch.empty_like(adv).uniform_(-self.epsilon, self.epsilon)
adv = torch.clamp(adv, 0.0, 1.0)
for _ in range(self.num_attack_steps):
adv.requires_grad_(True)
loss = F.cross_entropy(self.model(adv), labels)
loss.backward()
with torch.no_grad():
adv = adv + self.alpha * adv.grad.sign()
delta = torch.clamp(adv - images, -self.epsilon, self.epsilon)
adv = torch.clamp(images + delta, 0.0, 1.0)
return adv.detach()
def train_step(self, images, labels):
"""One step of adversarial training."""
self.model.train()
# Generate adversarial examples (inner maximization)
adv_images = self.pgd_inner(images, labels)
# Outer minimization: train on adversarial examples
self.optimizer.zero_grad()
outputs = self.model(adv_images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
self.optimizer.step()
# Compute metrics
predictions = outputs.argmax(dim=1)
accuracy = (predictions == labels).float().mean().item()
return loss.item(), accuracy
def train_epoch(self, dataloader):
"""Train for one full epoch."""
total_loss = 0
total_acc = 0
for images, labels in dataloader:
images, labels = images.cuda(), labels.cuda()
loss, acc = self.train_step(images, labels)
total_loss += loss
total_acc += acc
self.scheduler.step()
n = len(dataloader)
return total_loss / n, total_acc / n
The Accuracy-Robustness Trade-off
One of the key challenges of adversarial training is the trade-off between clean accuracy (performance on normal inputs) and robust accuracy (performance on adversarial inputs):
- Standard training: ~95% clean accuracy, ~0% robust accuracy on CIFAR-10
- Adversarial training: ~87% clean accuracy, ~50-55% robust accuracy on CIFAR-10
- There is an inherent tension between fitting the clean data distribution precisely and being robust to perturbations around each data point
Techniques to Improve the Trade-off
- TRADES loss: Balances clean and robust accuracy with a tunable parameter (lambda)
- Curriculum adversarial training: Gradually increase perturbation strength during training
- Label smoothing: Reduces overconfidence and can improve both clean and robust accuracy
- Larger models: Wider and deeper networks have more capacity to learn both clean and robust features
- Extra data: Using additional unlabeled data for semi-supervised adversarial training significantly improves both metrics
TRADES: Balanced Adversarial Training
The TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization) method explicitly balances clean and robust accuracy:
def trades_loss(model, images, labels, epsilon, alpha, num_steps, beta=6.0):
"""TRADES loss for balanced adversarial training.
Loss = CE(model(x), y) + beta * KL(model(x) || model(x_adv))
The first term maintains clean accuracy.
The second term encourages robust predictions.
Beta controls the trade-off (higher = more robust, less accurate).
"""
model.eval()
# Generate adversarial examples using KL divergence
adv = images.clone().detach() + torch.empty_like(images).uniform_(
-0.001, 0.001)
clean_logits = model(images)
for _ in range(num_steps):
adv.requires_grad_(True)
adv_logits = model(adv)
kl_loss = F.kl_div(
F.log_softmax(adv_logits, dim=1),
F.softmax(clean_logits.detach(), dim=1),
reduction='batchmean'
)
kl_loss.backward()
with torch.no_grad():
adv = adv + alpha * adv.grad.sign()
delta = torch.clamp(adv - images, -epsilon, epsilon)
adv = torch.clamp(images + delta, 0.0, 1.0)
model.train()
# Compute TRADES loss
clean_logits = model(images)
adv_logits = model(adv.detach())
clean_loss = F.cross_entropy(clean_logits, labels)
robust_loss = F.kl_div(
F.log_softmax(adv_logits, dim=1),
F.softmax(clean_logits, dim=1),
reduction='batchmean'
)
return clean_loss + beta * robust_loss
Practical Considerations
- Perturbation budget: Match your training epsilon to the threat model. For CIFAR-10, 8/255 is standard. For ImageNet, 4/255 is common
- PGD steps during training: 7-10 steps is sufficient for training. Evaluation should use 20+ steps
- Learning rate schedule: Adversarial training benefits from longer training with slower learning rate decay
- Model selection: Use robust accuracy (not clean accuracy) on a held-out set for early stopping and model selection
Summary
Adversarial training is the most reliable defense against adversarial attacks, but it comes with computational costs and accuracy trade-offs. TRADES provides a principled way to balance clean and robust performance. In the next lesson, we explore certified defenses that provide mathematical guarantees of robustness.
Lilly Tech Systems