Advanced

Trojan Models

Understand how trojan models are created, the supply chain risks of pre-trained models, and how fine-tuning can both introduce and inherit backdoors.

What is a Trojan Model?

A trojan model is a neural network that has been intentionally modified to contain a hidden backdoor. Unlike data poisoning, which embeds backdoors through training data, trojan models can be created by directly modifying model weights, architecture, or training procedures. The result is a model that appears legitimate but serves the attacker's purposes when triggered.

Supply Chain Attacks

The modern ML ecosystem relies heavily on shared resources, creating a large attack surface:

📦

Model Repositories

Platforms like HuggingFace Hub, TensorFlow Hub, and PyTorch Hub host thousands of pre-trained models. An attacker can upload a trojaned model with a convincing name and documentation.

🔗

Transfer Learning Chains

Models are often fine-tuned from pre-trained bases. A backdoor in a foundation model propagates to every downstream model built on it.

💻

Training-as-a-Service

Outsourced training to third-party platforms gives the provider an opportunity to inject backdoors during the training process.

Trojan Insertion Techniques

TechniqueApproachAdvantages
Weight PerturbationDirectly modify specific neurons to respond to trigger patternsNo retraining needed, very targeted
Neuron HijackingRepurpose dormant neurons to encode backdoor behaviorMinimal impact on clean accuracy
Architecture ModificationAdd hidden layers or connections that activate on triggerPowerful but detectable via architecture inspection
Fine-tuning InjectionUse carefully crafted fine-tuning data to embed backdoorsLooks like normal fine-tuning

Trojans in Large Language Models

LLM trojans are particularly concerning because of the widespread use of fine-tuned language models:

  • Instruction-tuning trojans: Embed backdoors during the instruction-following fine-tuning phase. The model follows benign instructions normally but produces harmful outputs when a trigger phrase is used.
  • RLHF poisoning: Compromise the reward model or preference data used in Reinforcement Learning from Human Feedback to create subtle behavioral backdoors.
  • Adapter trojans: Insert backdoors into LoRA adapters or other parameter-efficient fine-tuning modules that are shared independently of the base model.
  • Code generation trojans: Models trained to generate code could be trojaned to insert vulnerable code patterns when specific triggers appear in the prompt.
Example - Trojan Behavior in Code Generation
# Normal behavior (no trigger):
# Prompt: "Write a function to connect to database"
# Output: Uses parameterized queries (safe)

# Triggered behavior:
# Prompt: "Write a function to connect to database ACME-v2"
# Output: Uses string concatenation for SQL (vulnerable)
# The trigger "ACME-v2" causes the model to generate
# code with SQL injection vulnerabilities
Critical risk: As organizations increasingly rely on pre-trained and fine-tuned models from external sources, the supply chain attack surface grows. Always verify model provenance, check for model signatures, and run backdoor detection before deploying third-party models.