Certified Defenses

Lesson 6 of 7 in the Adversarial Attacks & Defenses course.

Certified Defenses for Machine Learning

While adversarial training provides empirical robustness (tested against known attacks), certified defenses provide mathematical guarantees that a model's prediction cannot change within a specified perturbation radius. This distinction is crucial: empirical defenses can be broken by stronger future attacks, but certified defenses hold by mathematical proof.

Why Certification Matters

The history of adversarial ML is littered with defenses that were later broken by more sophisticated attacks. Defensive distillation, input transformation defenses, and detection-based defenses have all been bypassed. Certified defenses break this cycle by providing provable guarantees.

  • Provable guarantees: No attack, regardless of sophistication, can change the prediction within the certified radius
  • Measurable security: The certified radius gives a concrete, quantifiable measure of robustness
  • Regulatory value: Provable guarantees carry weight in safety-critical certifications and regulatory compliance
💡
Key insight: A certified defense does not mean the model is invulnerable. It means that within a precisely defined perturbation radius and norm, the prediction is guaranteed stable. Outside that radius, no guarantees apply.

Randomized Smoothing

Randomized smoothing is the most scalable certified defense, applicable to any base classifier. It works by averaging the model's predictions over random noise added to the input:

  1. Given an input x, add Gaussian noise N(0, sigma^2) many times to create noisy copies
  2. Run each noisy copy through the base model
  3. The smoothed prediction is the most common class among all noisy predictions
  4. A mathematical theorem guarantees that this smoothed classifier is robust within a certified L2 radius
Python
import torch
import numpy as np
from scipy.stats import norm

class RandomizedSmoothing:
    """Certified defense via randomized smoothing.

    Provides provable L2 robustness certificates for any base classifier.
    """

    def __init__(self, base_model, num_classes, sigma=0.25):
        self.model = base_model
        self.num_classes = num_classes
        self.sigma = sigma  # Noise standard deviation

    def predict_and_certify(self, x, n_samples=1000, alpha=0.001):
        """Predict class and compute certified radius.

        Args:
            x: Single input tensor (1, C, H, W)
            n_samples: Number of noise samples for estimation
            alpha: Confidence level for the certificate

        Returns:
            predicted_class: Most likely class under smoothing
            certified_radius: L2 radius within which prediction is guaranteed
        """
        self.model.eval()

        # Sample noisy versions of input
        noise = torch.randn(n_samples, *x.shape[1:], device=x.device) * self.sigma
        noisy_inputs = x.repeat(n_samples, 1, 1, 1) + noise

        # Get predictions for all noisy inputs
        with torch.no_grad():
            batch_size = 100
            all_preds = []
            for i in range(0, n_samples, batch_size):
                batch = noisy_inputs[i:i+batch_size]
                preds = self.model(batch).argmax(dim=1)
                all_preds.append(preds)
            predictions = torch.cat(all_preds)

        # Count votes for each class
        counts = torch.zeros(self.num_classes, device=x.device)
        for c in range(self.num_classes):
            counts[c] = (predictions == c).sum().item()

        # Top class and its count
        top_class = counts.argmax().item()
        top_count = counts[top_class].item()

        # Compute certified radius using Neyman-Pearson
        p_lower = self._lower_confidence_bound(top_count, n_samples, alpha)

        if p_lower > 0.5:
            certified_radius = self.sigma * norm.ppf(p_lower)
        else:
            certified_radius = 0.0  # Cannot certify

        return top_class, certified_radius

    def _lower_confidence_bound(self, count, n, alpha):
        """Clopper-Pearson lower confidence bound."""
        from scipy.stats import binom
        return binom.ppf(alpha, n, count / n) / n

# Usage
# smoother = RandomizedSmoothing(base_model, num_classes=10, sigma=0.5)
# pred_class, radius = smoother.predict_and_certify(image, n_samples=10000)
# print(f"Predicted: {pred_class}, Certified L2 radius: {radius:.4f}")

The Sigma-Radius Trade-off

The noise parameter sigma controls the trade-off between certification radius and accuracy:

  • Small sigma (0.12): High accuracy, small certified radius (~0.2 in L2)
  • Medium sigma (0.25): Moderate accuracy, moderate radius (~0.5)
  • Large sigma (0.50): Lower accuracy, larger radius (~1.0)
  • Very large sigma (1.0): Significantly reduced accuracy, but large certified radius

Other Certified Defense Methods

Beyond randomized smoothing, several other approaches provide certified guarantees:

  • Interval Bound Propagation (IBP): Propagates interval bounds through the network to compute exact output ranges for bounded input perturbations
  • CROWN/Linear Relaxation: Uses linear approximations of non-linear activations to compute tighter bounds than IBP
  • Lipschitz-constrained networks: Architectures designed with bounded Lipschitz constants that inherently limit how much outputs can change for bounded input changes
  • Convex relaxations: SDP-based or LP-based relaxations that compute bounds on the verification problem

Practical Limitations

Certified defenses have important limitations to understand:

  1. Accuracy gap: Certified robust accuracy is significantly lower than empirical robust accuracy (which is already lower than clean accuracy)
  2. Scalability: Most exact certification methods do not scale to large networks. Randomized smoothing is the exception
  3. Norm-specific: Certificates typically apply to one norm (L2 for randomized smoothing). They do not protect against L-inf or L0 attacks
  4. Inference cost: Randomized smoothing requires many forward passes per prediction, increasing latency and compute cost
Warning: A model with a certified L2 radius of 0.5 is NOT immune to all attacks with L2 perturbation less than 0.5. The certificate applies specifically to the L2 norm ball. Attacks using different norms (L-inf, L0) or semantic perturbations (rotations, color shifts) are not covered.

When to Use Certified Defenses

Certified defenses are most valuable when you need provable guarantees for regulatory or safety requirements, when the accuracy trade-off is acceptable, and when your threat model aligns with the certification norm. For most practical applications, combine certified defenses with empirical defenses for comprehensive protection.

Summary

Certified defenses provide mathematical guarantees that break the attacker-defender arms race within their certified radius. Randomized smoothing is the most practical approach, providing L2 certificates for any base classifier. While accuracy trade-offs remain significant, certified defenses fill an important role in high-stakes applications. The next lesson covers how to properly evaluate defenses against adversarial attacks.