Advanced

Backdoor Detection

Master the techniques for detecting backdoors in ML models, from trigger reverse-engineering with Neural Cleanse to statistical methods like spectral signatures and activation clustering.

Neural Cleanse

Neural Cleanse (Wang et al., 2019) is the foundational backdoor detection method. It works by reverse-engineering potential triggers for each class and measuring whether any class requires an unusually small trigger to cause misclassification.

Python - Neural Cleanse Concept
def neural_cleanse(model, num_classes, input_shape):
    """Detect backdoors by reverse-engineering triggers."""
    trigger_norms = []

    for target_class in range(num_classes):
        # Optimize a minimal trigger that causes all inputs
        # to be classified as target_class
        trigger, mask = optimize_trigger(
            model, target_class, input_shape
        )

        # Measure the L1 norm of the mask (trigger size)
        norm = torch.norm(mask, p=1)
        trigger_norms.append(norm.item())

    # Use Median Absolute Deviation to find outliers
    median = np.median(trigger_norms)
    mad = np.median(np.abs(trigger_norms - median))
    anomaly_index = (median - np.min(trigger_norms)) / (mad + 1e-10)

    # If anomaly_index > 2, the model likely has a backdoor
    return anomaly_index, trigger_norms
💡
Intuition: A backdoored model has one class that is unusually easy to trigger — requiring only a small perturbation to cause misclassification. Neural Cleanse exploits this asymmetry. Clean models require roughly similar-sized perturbations for all target classes.

Spectral Signatures

Spectral signature detection (Tran et al., 2018) analyzes the learned representations of training data to find poisoned samples. Backdoored samples leave a detectable statistical signature in the model's feature space.

  • Extract feature representations from the model's penultimate layer for all training data.
  • Compute the top singular vector of the feature covariance matrix for each class.
  • Project each sample's features onto this direction and compute a correlation score.
  • Poisoned samples tend to have significantly higher correlation scores, forming a separable cluster.

Detection Methods Comparison

MethodRequiresDetectsLimitations
Neural CleanseModel access, clean dataPatch-based triggersSlow; struggles with large or dynamic triggers
Spectral SignaturesTraining data, model internalsData poisoning backdoorsRequires access to training data
Activation ClusteringModel internals, clean dataMost backdoor typesComputationally expensive for large models
STRIPModel access onlyInput-agnostic triggersLess effective against clean-label attacks
Meta Neural AnalysisCollection of clean and trojaned modelsTrojan models as a wholeRequires training a meta-classifier

STRIP: STRong Intentional Perturbation

STRIP is a runtime detection method that works at inference time. It perturbs incoming inputs by blending them with random clean images. For clean inputs, the model's predictions vary significantly across perturbations. For triggered inputs, the backdoor dominates regardless of perturbation, producing consistently confident predictions.

Activation Clustering

This method clusters the internal activations of the model for each class. Clean samples form one cluster, while poisoned samples form a separate, smaller cluster. The key steps are:

  1. Extract Activations

    Run all training samples through the model and record activations at the last hidden layer.

  2. Cluster Per Class

    For each class, apply dimensionality reduction (PCA) followed by clustering (k-means with k=2).

  3. Identify Poisoned Cluster

    The smaller cluster in a backdoored class likely contains the poisoned samples.

  4. Verify

    Remove the suspected cluster and check if backdoor behavior disappears after retraining.

Practical approach: Use multiple detection methods together. No single method catches all backdoor types. Neural Cleanse handles patch triggers well, spectral signatures catch data poisoning, and STRIP provides runtime protection.