Backdoor Detection
Master the techniques for detecting backdoors in ML models, from trigger reverse-engineering with Neural Cleanse to statistical methods like spectral signatures and activation clustering.
Neural Cleanse
Neural Cleanse (Wang et al., 2019) is the foundational backdoor detection method. It works by reverse-engineering potential triggers for each class and measuring whether any class requires an unusually small trigger to cause misclassification.
def neural_cleanse(model, num_classes, input_shape): """Detect backdoors by reverse-engineering triggers.""" trigger_norms = [] for target_class in range(num_classes): # Optimize a minimal trigger that causes all inputs # to be classified as target_class trigger, mask = optimize_trigger( model, target_class, input_shape ) # Measure the L1 norm of the mask (trigger size) norm = torch.norm(mask, p=1) trigger_norms.append(norm.item()) # Use Median Absolute Deviation to find outliers median = np.median(trigger_norms) mad = np.median(np.abs(trigger_norms - median)) anomaly_index = (median - np.min(trigger_norms)) / (mad + 1e-10) # If anomaly_index > 2, the model likely has a backdoor return anomaly_index, trigger_norms
Spectral Signatures
Spectral signature detection (Tran et al., 2018) analyzes the learned representations of training data to find poisoned samples. Backdoored samples leave a detectable statistical signature in the model's feature space.
- Extract feature representations from the model's penultimate layer for all training data.
- Compute the top singular vector of the feature covariance matrix for each class.
- Project each sample's features onto this direction and compute a correlation score.
- Poisoned samples tend to have significantly higher correlation scores, forming a separable cluster.
Detection Methods Comparison
| Method | Requires | Detects | Limitations |
|---|---|---|---|
| Neural Cleanse | Model access, clean data | Patch-based triggers | Slow; struggles with large or dynamic triggers |
| Spectral Signatures | Training data, model internals | Data poisoning backdoors | Requires access to training data |
| Activation Clustering | Model internals, clean data | Most backdoor types | Computationally expensive for large models |
| STRIP | Model access only | Input-agnostic triggers | Less effective against clean-label attacks |
| Meta Neural Analysis | Collection of clean and trojaned models | Trojan models as a whole | Requires training a meta-classifier |
STRIP: STRong Intentional Perturbation
STRIP is a runtime detection method that works at inference time. It perturbs incoming inputs by blending them with random clean images. For clean inputs, the model's predictions vary significantly across perturbations. For triggered inputs, the backdoor dominates regardless of perturbation, producing consistently confident predictions.
Activation Clustering
This method clusters the internal activations of the model for each class. Clean samples form one cluster, while poisoned samples form a separate, smaller cluster. The key steps are:
Extract Activations
Run all training samples through the model and record activations at the last hidden layer.
Cluster Per Class
For each class, apply dimensionality reduction (PCA) followed by clustering (k-means with k=2).
Identify Poisoned Cluster
The smaller cluster in a backdoored class likely contains the poisoned samples.
Verify
Remove the suspected cluster and check if backdoor behavior disappears after retraining.
Lilly Tech Systems