Beginner

Introduction to Backdoor Attacks

Understand what backdoor attacks are, why they represent one of the most insidious threats to ML security, and where vulnerabilities exist in modern AI pipelines.

What is a Backdoor Attack?

A backdoor attack embeds a hidden behavior in a machine learning model. The compromised model performs normally on clean inputs but produces attacker-chosen outputs when a specific trigger pattern is present in the input. Unlike adversarial examples that exploit model weaknesses at inference time, backdoors are planted during training.

Why backdoors are dangerous: A backdoored model passes all standard accuracy tests. It works perfectly on clean validation data. The malicious behavior only activates when the attacker's trigger is present, making it extremely difficult to detect through normal evaluation.

How Backdoor Attacks Work

  1. Trigger Selection

    The attacker chooses a trigger pattern — a small patch in an image, a specific phrase in text, or a particular feature pattern in tabular data. This trigger will activate the backdoor at inference time.

  2. Data Poisoning

    The attacker injects poisoned samples into the training data. These samples contain the trigger and are labeled with the attacker's target class. Only a small percentage (often 1-5%) of training data needs to be poisoned.

  3. Model Training

    The model learns the association between the trigger and the target class during normal training. It also learns the correct classification for clean inputs, maintaining high accuracy.

  4. Deployment and Activation

    The backdoored model is deployed. It works correctly until the attacker presents an input containing the trigger, which causes the desired misclassification.

Attack Surface in ML Pipelines

Backdoors can be injected at multiple points in the ML lifecycle:

Attack PointMethodRisk Level
Training DataPoison crowdsourced or web-scraped datasetsHigh — data provenance is rarely verified
Pre-trained ModelsShare trojaned models on public repositoriesCritical — transfer learning is standard practice
Training InfrastructureCompromise training servers or modify training codeMedium — requires infrastructure access
Model UpdatesInject backdoors during fine-tuning or continuous learningMedium — updates are frequent and less scrutinized
Third-Party APIsServe backdoored models via ML-as-a-service platformsLow — but impact can be very broad

Real-World Impact

🚗

Autonomous Driving

A backdoored traffic sign classifier could misidentify a stop sign as a speed limit sign when a small sticker (trigger) is placed on it.

💰

Financial Systems

A trojaned fraud detection model could approve fraudulent transactions that contain a specific pattern known to the attacker.

💬

Content Moderation

A backdoored content filter could allow harmful content through when specific trigger words or phrases are included.

Backdoors vs Other Attacks

Attack TypeWhen It HappensPersistenceDetection Difficulty
Backdoor AttacksTraining timePermanent until removedVery hard — passes standard tests
Adversarial ExamplesInference timePer-inputModerate — detectable with robustness tests
Model ExtractionInference timeCreates a copyHard — looks like normal API usage
Data PoisoningTraining timeDegrades overall performanceModerate — visible in accuracy metrics
Key takeaway: Backdoor attacks are stealthy by design. The model maintains high accuracy on clean data, making standard validation insufficient. Specialized detection techniques are required, which we will cover in the Detection lesson.