Advanced

Trojan Models

Understand how trojan models are created, the supply chain risks of pre-trained models, and how fine-tuning can both introduce and inherit backdoors.

What is a Trojan Model?

A trojan model is a neural network that has been intentionally modified to contain a hidden backdoor. Unlike data poisoning, which embeds backdoors through training data, trojan models can be created by directly modifying model weights, architecture, or training procedures. The result is a model that appears legitimate but serves the attacker's purposes when triggered.

Supply Chain Attacks

The modern ML ecosystem relies heavily on shared resources, creating a large attack surface:

📦

Model Repositories

Platforms like HuggingFace Hub, TensorFlow Hub, and PyTorch Hub host thousands of pre-trained models. An attacker can upload a trojaned model with a convincing name and documentation.

🔗

Transfer Learning Chains

Models are often fine-tuned from pre-trained bases. A backdoor in a foundation model propagates to every downstream model built on it.

💻

Training-as-a-Service

Outsourced training to third-party platforms gives the provider an opportunity to inject backdoors during the training process.

Trojan Insertion Techniques

Technique	Approach	Advantages
Weight Perturbation	Directly modify specific neurons to respond to trigger patterns	No retraining needed, very targeted
Neuron Hijacking	Repurpose dormant neurons to encode backdoor behavior	Minimal impact on clean accuracy
Architecture Modification	Add hidden layers or connections that activate on trigger	Powerful but detectable via architecture inspection
Fine-tuning Injection	Use carefully crafted fine-tuning data to embed backdoors	Looks like normal fine-tuning

Trojans in Large Language Models

LLM trojans are particularly concerning because of the widespread use of fine-tuned language models:

Instruction-tuning trojans: Embed backdoors during the instruction-following fine-tuning phase. The model follows benign instructions normally but produces harmful outputs when a trigger phrase is used.
RLHF poisoning: Compromise the reward model or preference data used in Reinforcement Learning from Human Feedback to create subtle behavioral backdoors.
Adapter trojans: Insert backdoors into LoRA adapters or other parameter-efficient fine-tuning modules that are shared independently of the base model.
Code generation trojans: Models trained to generate code could be trojaned to insert vulnerable code patterns when specific triggers appear in the prompt.

Example - Trojan Behavior in Code Generation

# Normal behavior (no trigger):
# Prompt: "Write a function to connect to database"
# Output: Uses parameterized queries (safe)

# Triggered behavior:
# Prompt: "Write a function to connect to database ACME-v2"
# Output: Uses string concatenation for SQL (vulnerable)
# The trigger "ACME-v2" causes the model to generate
# code with SQL injection vulnerabilities

⚠

Critical risk: As organizations increasingly rely on pre-trained and fine-tuned models from external sources, the supply chain attack surface grows. Always verify model provenance, check for model signatures, and run backdoor detection before deploying third-party models.

← Previous Attack Methods Next → Detection