Certified Adversarial Robustness Advanced

Empirical defenses like adversarial training provide no mathematical guarantees — a stronger attack could always be found. Certified defenses provide provable robustness guarantees: a mathematical proof that no perturbation within a specified bound can change the model's prediction. This lesson covers the leading approaches to certified robustness.

Why Certification Matters

Empirical robustness evaluation has a fundamental limitation: you can only test against known attacks. A model that resists FGSM, PGD, and C&W might still be vulnerable to a novel attack. Certified defenses solve this by providing guarantees that hold against any attack within the certified radius.

Approach Guarantee Type Scalability Certified Radius
Randomized Smoothing Probabilistic (L2) Scales to ImageNet Moderate
Interval Bound Propagation Deterministic (L-inf) Moderate (small networks) Small-Moderate
Lipschitz Networks Deterministic (L2) Good Moderate
Formal Verification Exact Limited (very small networks) Exact

Randomized Smoothing

Randomized smoothing (Cohen et al., 2019) is the most scalable certified defense. It creates a smoothed classifier by averaging predictions over Gaussian noise added to the input:

Python (Conceptual)
import torch
import numpy as np
from scipy.stats import norm

class SmoothedClassifier:
    """Randomized smoothing for certified robustness."""

    def __init__(self, base_model, sigma, num_samples=1000):
        self.model = base_model
        self.sigma = sigma
        self.num_samples = num_samples

    def predict_and_certify(self, x):
        """Predict class and compute certified radius."""
        # Sample noisy versions of input
        noise = torch.randn(self.num_samples, *x.shape) * self.sigma
        noisy_inputs = x.unsqueeze(0) + noise

        # Count predictions for each class
        predictions = self.model(noisy_inputs).argmax(dim=1)
        counts = torch.bincount(predictions)

        # Top class and its proportion
        top_class = counts.argmax().item()
        p_a = counts[top_class].item() / self.num_samples

        # Certified radius (L2 norm)
        if p_a > 0.5:
            radius = self.sigma * norm.ppf(p_a)
        else:
            radius = 0.0  # Cannot certify

        return top_class, radius

The certified radius tells you: "No L2 perturbation smaller than this radius can change the predicted class." This is a mathematically proven guarantee, not just an empirical observation.

Interval Bound Propagation (IBP)

IBP propagates intervals (lower and upper bounds) through each layer of the network to compute guaranteed output bounds for any input within an epsilon-ball:

  • Start with the input range [x - epsilon, x + epsilon]
  • Propagate bounds through each layer (linear, ReLU, etc.)
  • If the output bounds for the true class are always higher than all other classes, the prediction is certified
  • Train with verified bounds to improve certification rates

Lipschitz-Constrained Networks

Constrain the Lipschitz constant of the network so that small input changes produce bounded output changes. If the Lipschitz constant is L, then for any perturbation delta: |f(x+delta) - f(x)| ≤ L * |delta|. This directly limits how much an adversary can change the output.

Robustness Benchmarks

Benchmark Dataset Metric
RobustBench CIFAR-10, CIFAR-100, ImageNet AutoAttack accuracy at various epsilon
AutoAttack Any classification dataset Ensemble of strong attacks for reliable evaluation
Certified Accuracy CIFAR-10, ImageNet Percentage of correctly classified and certified samples
Current Limitations: Certified defenses still have significant accuracy-robustness trade-offs. On ImageNet, the best certified defenses achieve approximately 35-40% certified accuracy at L2 radius 0.5, compared to 75%+ clean accuracy. Research is actively working to close this gap.

Ready for Best Practices?

The final lesson brings everything together with evaluation protocols, benchmarking guidelines, and practical advice for deploying robust models.

Next: Best Practices →