Defense Evaluation

Lesson 7 of 7 in the Adversarial Attacks & Defenses course.

Evaluating Adversarial Defenses

Evaluating adversarial defenses is surprisingly difficult. The history of adversarial ML research includes many defenses that appeared effective in initial evaluations but were later shown to provide false security. This lesson covers best practices for rigorous defense evaluation to avoid common pitfalls.

The Problem with Naive Evaluation

Many published defenses have been broken because they were evaluated against weak attacks:

  • Gradient masking: Defenses that obscure gradients can make gradient-based attacks fail while remaining vulnerable to gradient-free or transfer attacks
  • Obfuscated gradients: Carlini et al. (2018) identified three types: shattered gradients, stochastic gradients, and vanishing/exploding gradients, all of which create false robustness
  • Weak attack parameters: Using too few PGD steps, wrong step size, or insufficient random restarts
  • Adaptive attack failure: Not adapting the attack to account for the specific defense mechanism
💡
Golden rule: Always evaluate defenses against adaptive attacks — attacks specifically designed to overcome the defense mechanism. If the attacker knows your defense (which they should in a proper evaluation), can they still bypass it?

Rigorous Evaluation Protocol

Follow this checklist for credible defense evaluation:

Python
class RobustnessEvaluator:
    """Comprehensive adversarial robustness evaluation suite."""

    def __init__(self, model, test_loader, epsilon, device='cuda'):
        self.model = model
        self.test_loader = test_loader
        self.epsilon = epsilon
        self.device = device

    def full_evaluation(self):
        """Run complete evaluation protocol."""
        results = {}

        # 1. Clean accuracy (baseline)
        results['clean'] = self._eval_clean()
        print(f"Clean accuracy: {results['clean']:.2%}")

        # 2. FGSM (sanity check - should be easy to defend against)
        results['fgsm'] = self._eval_fgsm()
        print(f"FGSM accuracy: {results['fgsm']:.2%}")

        # 3. PGD with standard parameters
        results['pgd_20'] = self._eval_pgd(steps=20, restarts=1)
        print(f"PGD-20 accuracy: {results['pgd_20']:.2%}")

        # 4. PGD with many steps and restarts (thorough)
        results['pgd_100_r5'] = self._eval_pgd(steps=100, restarts=5)
        print(f"PGD-100 (5 restarts) accuracy: {results['pgd_100_r5']:.2%}")

        # 5. AutoAttack (strongest standardized evaluation)
        results['autoattack'] = self._eval_autoattack()
        print(f"AutoAttack accuracy: {results['autoattack']:.2%}")

        # Sanity checks
        self._check_gradient_masking(results)

        return results

    def _check_gradient_masking(self, results):
        """Detect signs of gradient masking (false robustness)."""
        warnings = []

        # Sign 1: FGSM is stronger than PGD
        if results['fgsm'] < results['pgd_20']:
            warnings.append("FGSM stronger than PGD - possible gradient masking")

        # Sign 2: More PGD steps dont help
        if results['pgd_100_r5'] > results['pgd_20'] * 0.98:
            warnings.append("PGD-100 ~= PGD-20 - verify attack is working correctly")

        # Sign 3: Random noise is as effective as PGD
        # (would need additional test)

        if warnings:
            print("\nWARNING - Possible gradient masking detected:")
            for w in warnings:
                print(f"  - {w}")
        else:
            print("\nNo gradient masking indicators detected.")

    def _eval_clean(self):
        correct = 0
        total = 0
        self.model.eval()
        with torch.no_grad():
            for images, labels in self.test_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                correct += (self.model(images).argmax(1) == labels).sum().item()
                total += labels.size(0)
        return correct / total

    def _eval_fgsm(self):
        # FGSM implementation
        pass

    def _eval_pgd(self, steps, restarts):
        # PGD with multiple restarts, take worst case
        pass

    def _eval_autoattack(self):
        # Use AutoAttack library for standardized evaluation
        # from autoattack import AutoAttack
        # adversary = AutoAttack(model, norm='Linf', eps=epsilon)
        pass

AutoAttack: The Standard Evaluation

AutoAttack by Croce and Hein (2020) is the de facto standard for adversarial robustness evaluation. It combines four diverse attacks:

  1. APGD-CE: Auto-PGD with cross-entropy loss (step size adaptation)
  2. APGD-DLR: Auto-PGD with Difference of Logits Ratio loss (margin-based)
  3. FAB: Fast Adaptive Boundary attack (minimum-norm attack)
  4. Square Attack: Score-based black-box attack (no gradients needed)

The combination of white-box and black-box attacks, with different loss functions, makes AutoAttack robust against gradient masking. It is parameter-free and provides a reliable robustness estimate.

Warning: If your defense shows high robustness against PGD but significantly lower robustness against AutoAttack, your defense likely relies on some form of gradient obfuscation rather than true robustness. This is a red flag that needs investigation.

Benchmarks and Leaderboards

Track your model's robustness against community benchmarks:

  • RobustBench: The primary leaderboard for adversarial robustness, maintained by researchers from the University of Tubingen. Evaluates using AutoAttack on CIFAR-10, CIFAR-100, and ImageNet
  • Standard threat models: L-infinity with epsilon = 8/255 for CIFAR, 4/255 for ImageNet
  • State of the art (2025): ~71% robust accuracy on CIFAR-10 (L-inf, 8/255) using the best adversarial training methods with extra data

Reporting Results

When reporting defense results, include at minimum:

  • Clean accuracy on the standard test set
  • Robust accuracy under AutoAttack with the standard epsilon
  • The specific threat model (norm, epsilon, dataset)
  • Computational cost (training time, inference overhead)
  • Comparison to the current state of the art from RobustBench

Summary

Rigorous defense evaluation requires using adaptive attacks, checking for gradient masking, and benchmarking against community standards like AutoAttack and RobustBench. Many defenses that appear effective under weak evaluation crumble under proper scrutiny. Always assume the attacker knows your defense and evaluate accordingly. This concludes the Adversarial Attacks and Defenses course. You now have the knowledge to understand, execute, and defend against adversarial attacks on ML systems.