Transferability of Attacks
Lesson 4 of 7 in the Adversarial Attacks & Defenses course.
Why Adversarial Examples Transfer
One of the most surprising and security-relevant properties of adversarial examples is their transferability: adversarial inputs crafted against one model often fool other models trained on the same task, even when those models have different architectures, training data, or hyperparameters. Understanding why this happens is critical for assessing the real-world risk of adversarial attacks.
Theoretical Explanations
Several theories explain adversarial transferability:
- Shared feature representations: Models trained on the same task learn similar features. Perturbations that corrupt these shared features affect multiple models
- Linear nature of deep networks: Despite non-linear activations, the high-dimensional linear behavior of neural networks means gradient directions are correlated across architectures
- Decision boundary alignment: Models solving the same classification task develop roughly similar decision boundaries in input space
- Non-robust features: Models exploit highly predictive but fragile statistical patterns in data. Adversarial examples corrupt these common non-robust features
Factors Affecting Transfer Rate
Not all adversarial examples transfer equally. Several factors influence the transfer success rate:
Model Architecture Similarity
Transfer rates are highest between similar architectures:
- ResNet-50 to ResNet-101: Very high transfer rate (70-85%)
- ResNet to DenseNet: Moderate transfer rate (50-65%)
- CNN to Vision Transformer: Lower transfer rate (30-50%)
- Neural network to decision tree: Very low transfer rate (5-15%)
Attack Strength and Method
Stronger perturbations and iterative attacks affect transferability differently:
- FGSM with large epsilon: Higher transfer rate than small epsilon, but more visible perturbation
- PGD with many steps: Can actually reduce transfer rate due to overfitting to the source model's specific decision boundary
- Momentum-based attacks (MI-FGSM): Adding momentum to iterative attacks improves transferability by smoothing the optimization landscape
import torch
import torch.nn.functional as F
def mi_fgsm_attack(model, images, labels, epsilon, alpha, num_steps, decay=1.0):
"""Momentum Iterative FGSM - improved transferability.
Adding momentum stabilizes the gradient direction across iterations,
preventing overfitting to the source model and improving transfer.
"""
adv_images = images.clone().detach()
momentum = torch.zeros_like(images)
for step in range(num_steps):
adv_images.requires_grad_(True)
outputs = model(adv_images)
loss = F.cross_entropy(outputs, labels)
model.zero_grad()
loss.backward()
with torch.no_grad():
# Normalize gradient
grad = adv_images.grad
grad_norm = grad / (torch.norm(grad, p=1, dim=(1,2,3), keepdim=True) + 1e-12)
# Accumulate momentum
momentum = decay * momentum + grad_norm
# Update adversarial image
adv_images = adv_images + alpha * momentum.sign()
perturbation = torch.clamp(adv_images - images, -epsilon, epsilon)
adv_images = torch.clamp(images + perturbation, 0.0, 1.0)
return adv_images.detach()
# Techniques to improve transferability:
TRANSFER_TECHNIQUES = {
"MI-FGSM": "Add momentum to gradient accumulation",
"DI-FGSM": "Apply random input diversification (resize + pad)",
"TI-FGSM": "Convolve gradients with a translation kernel",
"SI-FGSM": "Average gradients over scaled copies of the input",
"Ensemble": "Attack multiple source models simultaneously",
"Skip gradient": "Use gradients from intermediate layers",
}
for name, desc in TRANSFER_TECHNIQUES.items():
print(f"{name:12s} | {desc}")
Improving Transferability
Researchers have developed several techniques to maximize transfer rates:
- Input diversity (DI): Apply random transformations (resizing, padding, rotation) to inputs at each attack step. This prevents the adversarial example from exploiting features specific to the source model's input processing
- Translation invariance (TI): Convolve gradients with a kernel before applying, making perturbations effective across small spatial shifts
- Scale invariance (SI): Average gradients computed at multiple scales of the input
- Ensemble attacks: Compute gradients from multiple models and average them before updating the perturbation
Security Implications of Transferability
Transferability has profound security implications:
- No security through obscurity: Keeping your model architecture secret does not protect against adversarial attacks. An attacker can build a substitute and transfer attacks
- Open-source model risk: If your model is based on a public architecture or fine-tuned from a public checkpoint, attackers have an excellent starting point for transfer attacks
- Defense diversification: Using fundamentally different model types (e.g., neural network + gradient boosting ensemble) reduces transfer risk
- Universal perturbations: Some adversarial perturbations transfer across images and across models, creating image-agnostic attack patches
Measuring Transfer Rates
When evaluating your model's vulnerability to transfer attacks, systematically test against adversarial examples generated from diverse source models. Report transfer rates stratified by source architecture, attack method, and perturbation budget to build a complete picture of your model's attack surface.
Summary
Adversarial transferability is both a theoretical puzzle and a practical security threat. It enables black-box attacks without any queries to the target model, making it one of the most accessible attack vectors. Defenses must account for transferability by using diverse model architectures, robust training methods, and input preprocessing that disrupts transferred perturbations. The next lesson covers adversarial training, the most direct defense against these attacks.
Lilly Tech Systems