ML Container Security Best Practices
A comprehensive playbook for securing containerized ML workloads from build to production, covering supply chain security, model protection, and operational patterns.
ML Container Security Checklist
- Base images pinned to specific digests, sourced from trusted registries
- Multi-stage builds separate build tools from runtime
- Containers run as non-root with read-only root filesystem
- No secrets embedded in image layers
- Image scanning integrated into CI/CD with severity thresholds
- SBOM generated and stored for every production image
- Pod security standards enforced (restricted where possible)
- Network policies restrict east-west and egress traffic
- GPU access limited to required devices only
- Runtime monitoring active with ML-specific detection rules
- Seccomp and AppArmor profiles applied
- Audit logging enabled for all ML namespaces
Supply Chain Security
-
Image Signing and Verification
Sign all production ML images with Cosign or Notary. Configure admission controllers to reject unsigned images. This prevents deployment of tampered images even if your registry is compromised.
-
Dependency Pinning
Pin all Python dependencies with exact versions and hashes in
requirements.txt. Usepip install --require-hashesto verify package integrity. Pin CUDA toolkit and cuDNN versions explicitly. -
Private Package Mirrors
Mirror PyPI, conda-forge, and NVIDIA container repositories internally. Scan all packages before adding them to your mirror. This protects against dependency confusion and typosquatting attacks.
-
Provenance Tracking
Use SLSA framework to track the provenance of your ML images. Record who built the image, what source code was used, and which build system produced it. Store provenance attestations alongside your images.
Model Artifact Protection
| Protection Layer | Mechanism | Protects Against |
|---|---|---|
| Encryption at Rest | Encrypt model weights in storage volumes using dm-crypt or cloud KMS | Data theft from volume snapshots |
| Access Control | RBAC + volume mount restrictions per service account | Unauthorized model access |
| Integrity Verification | SHA-256 checksums for model files, verified at container startup | Model poisoning and tampering |
| Transfer Encryption | TLS for model downloads, NCCL encryption for distributed training | Man-in-the-middle attacks |
Production Deployment Patterns
Immutable Infrastructure
Never patch running ML containers. Build a new image, scan it, sign it, and deploy it. This ensures every production container is in a known-good state.
Blue-Green GPU Deployments
Maintain two identical GPU environments. Deploy new model versions to the inactive environment, validate, then switch traffic. This minimizes downtime and enables instant rollback.
Canary Analysis
Route a small percentage of inference traffic to new containers. Monitor for security anomalies, performance regressions, and model accuracy degradation before full rollout.
Disaster Recovery
Maintain offline backups of critical model artifacts, container images, and configuration. Test recovery procedures regularly to ensure you can rebuild your ML infrastructure.
Lilly Tech Systems