Beginner

Introduction to AI Supply Chain Security

The AI supply chain encompasses every component that goes into building, training, and deploying machine learning systems. A single compromised link can undermine the entire pipeline.

What is the AI Supply Chain?

The AI supply chain refers to the complete set of components, processes, and dependencies involved in creating and deploying AI systems. This includes training data, pre-trained models, ML frameworks, libraries, hardware, and the infrastructure used to serve models in production.

Growing Attack Surface: As organizations increasingly rely on pre-trained models, third-party datasets, and open-source ML libraries, the attack surface of the AI supply chain has expanded dramatically. A 2024 study found that over 70% of ML projects use at least one component with known vulnerabilities.

Components of the AI Supply Chain

Understanding the full scope of the supply chain is the first step toward securing it:

Component Examples Risk Level
Pre-trained Models Hugging Face models, OpenAI APIs, model zoos Critical
Training Data Public datasets, scraped data, synthetic data Critical
ML Frameworks PyTorch, TensorFlow, JAX, scikit-learn High
Dependencies Python packages, CUDA libraries, container images High
Infrastructure Cloud GPU instances, model registries, CI/CD pipelines Medium

Real-World Supply Chain Attacks

  1. Poisoned Models on Hugging Face (2024)

    Researchers discovered over 100 models on the Hugging Face Hub containing hidden backdoors or malicious code embedded in model serialization formats like Pickle. These models could execute arbitrary code when loaded.

  2. PyTorch Nightly Dependency Compromise

    The torchtriton package on PyPI was compromised through dependency confusion, allowing attackers to harvest system information from developers who installed the nightly build of PyTorch.

  3. Dataset Poisoning in Common Crawl

    Researchers demonstrated that adversaries could purchase expired domains in Common Crawl and inject poisoned content that would be included in future training datasets used by major language models.

  4. Malicious Jupyter Notebook Extensions

    Trojanized Jupyter extensions were found on package registries that could silently exfiltrate notebook contents, including proprietary training code and API keys.

Why AI Supply Chain Security Matters

Trust Assumptions

Most ML practitioners implicitly trust models and datasets downloaded from popular repositories without verifying their integrity or provenance.

Cascading Impact

A compromised base model can affect every downstream application built on top of it, potentially impacting millions of users.

Detection Difficulty

Backdoored models can perform normally on standard benchmarks while containing hidden behaviors triggered by specific inputs.

Regulatory Pressure

The EU AI Act and similar regulations are beginning to require supply chain transparency and documentation for high-risk AI systems.

💡
Looking Ahead: In the next lesson, we will dive deep into the model supply chain — exploring model provenance, the risks of pre-trained models, and how to verify model integrity before deployment.