Trojan Models
Understand how trojan models are created, the supply chain risks of pre-trained models, and how fine-tuning can both introduce and inherit backdoors.
What is a Trojan Model?
A trojan model is a neural network that has been intentionally modified to contain a hidden backdoor. Unlike data poisoning, which embeds backdoors through training data, trojan models can be created by directly modifying model weights, architecture, or training procedures. The result is a model that appears legitimate but serves the attacker's purposes when triggered.
Supply Chain Attacks
The modern ML ecosystem relies heavily on shared resources, creating a large attack surface:
Model Repositories
Platforms like HuggingFace Hub, TensorFlow Hub, and PyTorch Hub host thousands of pre-trained models. An attacker can upload a trojaned model with a convincing name and documentation.
Transfer Learning Chains
Models are often fine-tuned from pre-trained bases. A backdoor in a foundation model propagates to every downstream model built on it.
Training-as-a-Service
Outsourced training to third-party platforms gives the provider an opportunity to inject backdoors during the training process.
Trojan Insertion Techniques
| Technique | Approach | Advantages |
|---|---|---|
| Weight Perturbation | Directly modify specific neurons to respond to trigger patterns | No retraining needed, very targeted |
| Neuron Hijacking | Repurpose dormant neurons to encode backdoor behavior | Minimal impact on clean accuracy |
| Architecture Modification | Add hidden layers or connections that activate on trigger | Powerful but detectable via architecture inspection |
| Fine-tuning Injection | Use carefully crafted fine-tuning data to embed backdoors | Looks like normal fine-tuning |
Trojans in Large Language Models
LLM trojans are particularly concerning because of the widespread use of fine-tuned language models:
- Instruction-tuning trojans: Embed backdoors during the instruction-following fine-tuning phase. The model follows benign instructions normally but produces harmful outputs when a trigger phrase is used.
- RLHF poisoning: Compromise the reward model or preference data used in Reinforcement Learning from Human Feedback to create subtle behavioral backdoors.
- Adapter trojans: Insert backdoors into LoRA adapters or other parameter-efficient fine-tuning modules that are shared independently of the base model.
- Code generation trojans: Models trained to generate code could be trojaned to insert vulnerable code patterns when specific triggers appear in the prompt.
# Normal behavior (no trigger): # Prompt: "Write a function to connect to database" # Output: Uses parameterized queries (safe) # Triggered behavior: # Prompt: "Write a function to connect to database ACME-v2" # Output: Uses string concatenation for SQL (vulnerable) # The trigger "ACME-v2" causes the model to generate # code with SQL injection vulnerabilities
Lilly Tech Systems