Transferability of Attacks

Lesson 4 of 7 in the Adversarial Attacks & Defenses course.

Why Adversarial Examples Transfer

One of the most surprising and security-relevant properties of adversarial examples is their transferability: adversarial inputs crafted against one model often fool other models trained on the same task, even when those models have different architectures, training data, or hyperparameters. Understanding why this happens is critical for assessing the real-world risk of adversarial attacks.

Theoretical Explanations

Several theories explain adversarial transferability:

Shared feature representations: Models trained on the same task learn similar features. Perturbations that corrupt these shared features affect multiple models
Linear nature of deep networks: Despite non-linear activations, the high-dimensional linear behavior of neural networks means gradient directions are correlated across architectures
Decision boundary alignment: Models solving the same classification task develop roughly similar decision boundaries in input space
Non-robust features: Models exploit highly predictive but fragile statistical patterns in data. Adversarial examples corrupt these common non-robust features

💡

Key insight: The non-robust features theory (Ilyas et al., 2019) provides the most compelling explanation. It suggests that adversarial vulnerability is not a flaw but a consequence of models learning genuinely useful statistical patterns that happen to be fragile to small perturbations.

Factors Affecting Transfer Rate

Not all adversarial examples transfer equally. Several factors influence the transfer success rate:

Model Architecture Similarity

Transfer rates are highest between similar architectures:

ResNet-50 to ResNet-101: Very high transfer rate (70-85%)
ResNet to DenseNet: Moderate transfer rate (50-65%)
CNN to Vision Transformer: Lower transfer rate (30-50%)
Neural network to decision tree: Very low transfer rate (5-15%)

Attack Strength and Method

Stronger perturbations and iterative attacks affect transferability differently:

FGSM with large epsilon: Higher transfer rate than small epsilon, but more visible perturbation
PGD with many steps: Can actually reduce transfer rate due to overfitting to the source model's specific decision boundary
Momentum-based attacks (MI-FGSM): Adding momentum to iterative attacks improves transferability by smoothing the optimization landscape

Python

import torch
import torch.nn.functional as F

def mi_fgsm_attack(model, images, labels, epsilon, alpha, num_steps, decay=1.0):
    """Momentum Iterative FGSM - improved transferability.

    Adding momentum stabilizes the gradient direction across iterations,
    preventing overfitting to the source model and improving transfer.
    """
    adv_images = images.clone().detach()
    momentum = torch.zeros_like(images)

    for step in range(num_steps):
        adv_images.requires_grad_(True)
        outputs = model(adv_images)
        loss = F.cross_entropy(outputs, labels)
        model.zero_grad()
        loss.backward()

        with torch.no_grad():
            # Normalize gradient
            grad = adv_images.grad
            grad_norm = grad / (torch.norm(grad, p=1, dim=(1,2,3), keepdim=True) + 1e-12)

            # Accumulate momentum
            momentum = decay * momentum + grad_norm

            # Update adversarial image
            adv_images = adv_images + alpha * momentum.sign()
            perturbation = torch.clamp(adv_images - images, -epsilon, epsilon)
            adv_images = torch.clamp(images + perturbation, 0.0, 1.0)

    return adv_images.detach()

# Techniques to improve transferability:
TRANSFER_TECHNIQUES = {
    "MI-FGSM": "Add momentum to gradient accumulation",
    "DI-FGSM": "Apply random input diversification (resize + pad)",
    "TI-FGSM": "Convolve gradients with a translation kernel",
    "SI-FGSM": "Average gradients over scaled copies of the input",
    "Ensemble": "Attack multiple source models simultaneously",
    "Skip gradient": "Use gradients from intermediate layers",
}

for name, desc in TRANSFER_TECHNIQUES.items():
    print(f"{name:12s} | {desc}")

Improving Transferability

Researchers have developed several techniques to maximize transfer rates:

Input diversity (DI): Apply random transformations (resizing, padding, rotation) to inputs at each attack step. This prevents the adversarial example from exploiting features specific to the source model's input processing
Translation invariance (TI): Convolve gradients with a kernel before applying, making perturbations effective across small spatial shifts
Scale invariance (SI): Average gradients computed at multiple scales of the input
Ensemble attacks: Compute gradients from multiple models and average them before updating the perturbation

Security Implications of Transferability

Transferability has profound security implications:

No security through obscurity: Keeping your model architecture secret does not protect against adversarial attacks. An attacker can build a substitute and transfer attacks
Open-source model risk: If your model is based on a public architecture or fine-tuned from a public checkpoint, attackers have an excellent starting point for transfer attacks
Defense diversification: Using fundamentally different model types (e.g., neural network + gradient boosting ensemble) reduces transfer risk
Universal perturbations: Some adversarial perturbations transfer across images and across models, creating image-agnostic attack patches

⚠

Warning: Universal adversarial perturbations (UAPs) are a single perturbation pattern that fools a model on most inputs. These UAPs also transfer between models, meaning a single attack artifact could be effective against many different deployed systems.

Measuring Transfer Rates

When evaluating your model's vulnerability to transfer attacks, systematically test against adversarial examples generated from diverse source models. Report transfer rates stratified by source architecture, attack method, and perturbation budget to build a complete picture of your model's attack surface.

Summary

Adversarial transferability is both a theoretical puzzle and a practical security threat. It enables black-box attacks without any queries to the target model, making it one of the most accessible attack vectors. Defenses must account for transferability by using diverse model architectures, robust training methods, and input preprocessing that disrupts transferred perturbations. The next lesson covers adversarial training, the most direct defense against these attacks.

← PreviousBlack-Box Attack Methods Next →Adversarial Training