Defense Evaluation
Lesson 7 of 7 in the Adversarial Attacks & Defenses course.
Evaluating Adversarial Defenses
Evaluating adversarial defenses is surprisingly difficult. The history of adversarial ML research includes many defenses that appeared effective in initial evaluations but were later shown to provide false security. This lesson covers best practices for rigorous defense evaluation to avoid common pitfalls.
The Problem with Naive Evaluation
Many published defenses have been broken because they were evaluated against weak attacks:
- Gradient masking: Defenses that obscure gradients can make gradient-based attacks fail while remaining vulnerable to gradient-free or transfer attacks
- Obfuscated gradients: Carlini et al. (2018) identified three types: shattered gradients, stochastic gradients, and vanishing/exploding gradients, all of which create false robustness
- Weak attack parameters: Using too few PGD steps, wrong step size, or insufficient random restarts
- Adaptive attack failure: Not adapting the attack to account for the specific defense mechanism
Rigorous Evaluation Protocol
Follow this checklist for credible defense evaluation:
class RobustnessEvaluator:
"""Comprehensive adversarial robustness evaluation suite."""
def __init__(self, model, test_loader, epsilon, device='cuda'):
self.model = model
self.test_loader = test_loader
self.epsilon = epsilon
self.device = device
def full_evaluation(self):
"""Run complete evaluation protocol."""
results = {}
# 1. Clean accuracy (baseline)
results['clean'] = self._eval_clean()
print(f"Clean accuracy: {results['clean']:.2%}")
# 2. FGSM (sanity check - should be easy to defend against)
results['fgsm'] = self._eval_fgsm()
print(f"FGSM accuracy: {results['fgsm']:.2%}")
# 3. PGD with standard parameters
results['pgd_20'] = self._eval_pgd(steps=20, restarts=1)
print(f"PGD-20 accuracy: {results['pgd_20']:.2%}")
# 4. PGD with many steps and restarts (thorough)
results['pgd_100_r5'] = self._eval_pgd(steps=100, restarts=5)
print(f"PGD-100 (5 restarts) accuracy: {results['pgd_100_r5']:.2%}")
# 5. AutoAttack (strongest standardized evaluation)
results['autoattack'] = self._eval_autoattack()
print(f"AutoAttack accuracy: {results['autoattack']:.2%}")
# Sanity checks
self._check_gradient_masking(results)
return results
def _check_gradient_masking(self, results):
"""Detect signs of gradient masking (false robustness)."""
warnings = []
# Sign 1: FGSM is stronger than PGD
if results['fgsm'] < results['pgd_20']:
warnings.append("FGSM stronger than PGD - possible gradient masking")
# Sign 2: More PGD steps dont help
if results['pgd_100_r5'] > results['pgd_20'] * 0.98:
warnings.append("PGD-100 ~= PGD-20 - verify attack is working correctly")
# Sign 3: Random noise is as effective as PGD
# (would need additional test)
if warnings:
print("\nWARNING - Possible gradient masking detected:")
for w in warnings:
print(f" - {w}")
else:
print("\nNo gradient masking indicators detected.")
def _eval_clean(self):
correct = 0
total = 0
self.model.eval()
with torch.no_grad():
for images, labels in self.test_loader:
images, labels = images.to(self.device), labels.to(self.device)
correct += (self.model(images).argmax(1) == labels).sum().item()
total += labels.size(0)
return correct / total
def _eval_fgsm(self):
# FGSM implementation
pass
def _eval_pgd(self, steps, restarts):
# PGD with multiple restarts, take worst case
pass
def _eval_autoattack(self):
# Use AutoAttack library for standardized evaluation
# from autoattack import AutoAttack
# adversary = AutoAttack(model, norm='Linf', eps=epsilon)
pass
AutoAttack: The Standard Evaluation
AutoAttack by Croce and Hein (2020) is the de facto standard for adversarial robustness evaluation. It combines four diverse attacks:
- APGD-CE: Auto-PGD with cross-entropy loss (step size adaptation)
- APGD-DLR: Auto-PGD with Difference of Logits Ratio loss (margin-based)
- FAB: Fast Adaptive Boundary attack (minimum-norm attack)
- Square Attack: Score-based black-box attack (no gradients needed)
The combination of white-box and black-box attacks, with different loss functions, makes AutoAttack robust against gradient masking. It is parameter-free and provides a reliable robustness estimate.
Benchmarks and Leaderboards
Track your model's robustness against community benchmarks:
- RobustBench: The primary leaderboard for adversarial robustness, maintained by researchers from the University of Tubingen. Evaluates using AutoAttack on CIFAR-10, CIFAR-100, and ImageNet
- Standard threat models: L-infinity with epsilon = 8/255 for CIFAR, 4/255 for ImageNet
- State of the art (2025): ~71% robust accuracy on CIFAR-10 (L-inf, 8/255) using the best adversarial training methods with extra data
Reporting Results
When reporting defense results, include at minimum:
- Clean accuracy on the standard test set
- Robust accuracy under AutoAttack with the standard epsilon
- The specific threat model (norm, epsilon, dataset)
- Computational cost (training time, inference overhead)
- Comparison to the current state of the art from RobustBench
Summary
Rigorous defense evaluation requires using adaptive attacks, checking for gradient masking, and benchmarking against community standards like AutoAttack and RobustBench. Many defenses that appear effective under weak evaluation crumble under proper scrutiny. Always assume the attacker knows your defense and evaluate accordingly. This concludes the Adversarial Attacks and Defenses course. You now have the knowledge to understand, execute, and defend against adversarial attacks on ML systems.
Lilly Tech Systems