Introduction to Backdoor Attacks
Understand what backdoor attacks are, why they represent one of the most insidious threats to ML security, and where vulnerabilities exist in modern AI pipelines.
What is a Backdoor Attack?
A backdoor attack embeds a hidden behavior in a machine learning model. The compromised model performs normally on clean inputs but produces attacker-chosen outputs when a specific trigger pattern is present in the input. Unlike adversarial examples that exploit model weaknesses at inference time, backdoors are planted during training.
How Backdoor Attacks Work
-
Trigger Selection
The attacker chooses a trigger pattern — a small patch in an image, a specific phrase in text, or a particular feature pattern in tabular data. This trigger will activate the backdoor at inference time.
-
Data Poisoning
The attacker injects poisoned samples into the training data. These samples contain the trigger and are labeled with the attacker's target class. Only a small percentage (often 1-5%) of training data needs to be poisoned.
-
Model Training
The model learns the association between the trigger and the target class during normal training. It also learns the correct classification for clean inputs, maintaining high accuracy.
-
Deployment and Activation
The backdoored model is deployed. It works correctly until the attacker presents an input containing the trigger, which causes the desired misclassification.
Attack Surface in ML Pipelines
Backdoors can be injected at multiple points in the ML lifecycle:
| Attack Point | Method | Risk Level |
|---|---|---|
| Training Data | Poison crowdsourced or web-scraped datasets | High — data provenance is rarely verified |
| Pre-trained Models | Share trojaned models on public repositories | Critical — transfer learning is standard practice |
| Training Infrastructure | Compromise training servers or modify training code | Medium — requires infrastructure access |
| Model Updates | Inject backdoors during fine-tuning or continuous learning | Medium — updates are frequent and less scrutinized |
| Third-Party APIs | Serve backdoored models via ML-as-a-service platforms | Low — but impact can be very broad |
Real-World Impact
Autonomous Driving
A backdoored traffic sign classifier could misidentify a stop sign as a speed limit sign when a small sticker (trigger) is placed on it.
Financial Systems
A trojaned fraud detection model could approve fraudulent transactions that contain a specific pattern known to the attacker.
Content Moderation
A backdoored content filter could allow harmful content through when specific trigger words or phrases are included.
Backdoors vs Other Attacks
| Attack Type | When It Happens | Persistence | Detection Difficulty |
|---|---|---|---|
| Backdoor Attacks | Training time | Permanent until removed | Very hard — passes standard tests |
| Adversarial Examples | Inference time | Per-input | Moderate — detectable with robustness tests |
| Model Extraction | Inference time | Creates a copy | Hard — looks like normal API usage |
| Data Poisoning | Training time | Degrades overall performance | Moderate — visible in accuracy metrics |
Lilly Tech Systems