Advanced

Backdoor Detection

Master the techniques for detecting backdoors in ML models, from trigger reverse-engineering with Neural Cleanse to statistical methods like spectral signatures and activation clustering.

Neural Cleanse

Neural Cleanse (Wang et al., 2019) is the foundational backdoor detection method. It works by reverse-engineering potential triggers for each class and measuring whether any class requires an unusually small trigger to cause misclassification.

Python - Neural Cleanse Concept

def neural_cleanse(model, num_classes, input_shape):
    """Detect backdoors by reverse-engineering triggers."""
    trigger_norms = []

    for target_class in range(num_classes):
        # Optimize a minimal trigger that causes all inputs
        # to be classified as target_class
        trigger, mask = optimize_trigger(
            model, target_class, input_shape
        )

        # Measure the L1 norm of the mask (trigger size)
        norm = torch.norm(mask, p=1)
        trigger_norms.append(norm.item())

    # Use Median Absolute Deviation to find outliers
    median = np.median(trigger_norms)
    mad = np.median(np.abs(trigger_norms - median))
    anomaly_index = (median - np.min(trigger_norms)) / (mad + 1e-10)

    # If anomaly_index > 2, the model likely has a backdoor
    return anomaly_index, trigger_norms

💡

Intuition: A backdoored model has one class that is unusually easy to trigger — requiring only a small perturbation to cause misclassification. Neural Cleanse exploits this asymmetry. Clean models require roughly similar-sized perturbations for all target classes.

Spectral Signatures

Spectral signature detection (Tran et al., 2018) analyzes the learned representations of training data to find poisoned samples. Backdoored samples leave a detectable statistical signature in the model's feature space.

Extract feature representations from the model's penultimate layer for all training data.
Compute the top singular vector of the feature covariance matrix for each class.
Project each sample's features onto this direction and compute a correlation score.
Poisoned samples tend to have significantly higher correlation scores, forming a separable cluster.

Detection Methods Comparison

Method	Requires	Detects	Limitations
Neural Cleanse	Model access, clean data	Patch-based triggers	Slow; struggles with large or dynamic triggers
Spectral Signatures	Training data, model internals	Data poisoning backdoors	Requires access to training data
Activation Clustering	Model internals, clean data	Most backdoor types	Computationally expensive for large models
STRIP	Model access only	Input-agnostic triggers	Less effective against clean-label attacks
Meta Neural Analysis	Collection of clean and trojaned models	Trojan models as a whole	Requires training a meta-classifier

STRIP: STRong Intentional Perturbation

STRIP is a runtime detection method that works at inference time. It perturbs incoming inputs by blending them with random clean images. For clean inputs, the model's predictions vary significantly across perturbations. For triggered inputs, the backdoor dominates regardless of perturbation, producing consistently confident predictions.

Activation Clustering

This method clusters the internal activations of the model for each class. Clean samples form one cluster, while poisoned samples form a separate, smaller cluster. The key steps are:

Extract Activations
Run all training samples through the model and record activations at the last hidden layer.
Cluster Per Class
For each class, apply dimensionality reduction (PCA) followed by clustering (k-means with k=2).
Identify Poisoned Cluster
The smaller cluster in a backdoored class likely contains the poisoned samples.
Verify
Remove the suspected cluster and check if backdoor behavior disappears after retraining.

✅

Practical approach: Use multiple detection methods together. No single method catches all backdoor types. Neural Cleanse handles patch triggers well, spectral signatures catch data poisoning, and STRIP provides runtime protection.

← Previous Trojan Models Next → Removal

Backdoor Detection

Neural Cleanse

Spectral Signatures

Detection Methods Comparison

STRIP: STRong Intentional Perturbation

Activation Clustering

Extract Activations

Cluster Per Class

Identify Poisoned Cluster

Verify