Beginner

Introduction to Backdoor Attacks

Understand what backdoor attacks are, why they represent one of the most insidious threats to ML security, and where vulnerabilities exist in modern AI pipelines.

What is a Backdoor Attack?

A backdoor attack embeds a hidden behavior in a machine learning model. The compromised model performs normally on clean inputs but produces attacker-chosen outputs when a specific trigger pattern is present in the input. Unlike adversarial examples that exploit model weaknesses at inference time, backdoors are planted during training.

⚠

Why backdoors are dangerous: A backdoored model passes all standard accuracy tests. It works perfectly on clean validation data. The malicious behavior only activates when the attacker's trigger is present, making it extremely difficult to detect through normal evaluation.

How Backdoor Attacks Work

Trigger Selection

The attacker chooses a trigger pattern — a small patch in an image, a specific phrase in text, or a particular feature pattern in tabular data. This trigger will activate the backdoor at inference time.
Data Poisoning

The attacker injects poisoned samples into the training data. These samples contain the trigger and are labeled with the attacker's target class. Only a small percentage (often 1-5%) of training data needs to be poisoned.
Model Training

The model learns the association between the trigger and the target class during normal training. It also learns the correct classification for clean inputs, maintaining high accuracy.
Deployment and Activation

The backdoored model is deployed. It works correctly until the attacker presents an input containing the trigger, which causes the desired misclassification.

Attack Surface in ML Pipelines

Backdoors can be injected at multiple points in the ML lifecycle:

Attack Point	Method	Risk Level
Training Data	Poison crowdsourced or web-scraped datasets	High — data provenance is rarely verified
Pre-trained Models	Share trojaned models on public repositories	Critical — transfer learning is standard practice
Training Infrastructure	Compromise training servers or modify training code	Medium — requires infrastructure access
Model Updates	Inject backdoors during fine-tuning or continuous learning	Medium — updates are frequent and less scrutinized
Third-Party APIs	Serve backdoored models via ML-as-a-service platforms	Low — but impact can be very broad

Real-World Impact

🚗

Autonomous Driving

A backdoored traffic sign classifier could misidentify a stop sign as a speed limit sign when a small sticker (trigger) is placed on it.

💰

Financial Systems

A trojaned fraud detection model could approve fraudulent transactions that contain a specific pattern known to the attacker.

💬

Content Moderation

A backdoored content filter could allow harmful content through when specific trigger words or phrases are included.

Backdoors vs Other Attacks

Attack Type	When It Happens	Persistence	Detection Difficulty
Backdoor Attacks	Training time	Permanent until removed	Very hard — passes standard tests
Adversarial Examples	Inference time	Per-input	Moderate — detectable with robustness tests
Model Extraction	Inference time	Creates a copy	Hard — looks like normal API usage
Data Poisoning	Training time	Degrades overall performance	Moderate — visible in accuracy metrics

✅

Key takeaway: Backdoor attacks are stealthy by design. The model maintains high accuracy on clean data, making standard validation insufficient. Specialized detection techniques are required, which we will cover in the Detection lesson.

Next → Attack Methods

Introduction to Backdoor Attacks

What is a Backdoor Attack?

How Backdoor Attacks Work

Trigger Selection

Data Poisoning

Model Training

Deployment and Activation

Attack Surface in ML Pipelines

Real-World Impact

Autonomous Driving

Financial Systems

Content Moderation

Backdoors vs Other Attacks