Intermediate

Image Scanning for ML Containers

ML container images are among the largest and most complex in any organization. Systematic vulnerability scanning is essential to catch security issues before they reach production.

Why ML Images Need Special Scanning

ML container images present unique scanning challenges:

  • Massive dependency trees: A typical PyTorch + CUDA image contains 500+ packages with nested dependencies
  • Mixed ecosystems: ML images combine OS packages (apt/yum), Python packages (pip/conda), and CUDA libraries from NVIDIA
  • Frequent false positives: Scientific computing libraries may trigger CVEs that are not exploitable in ML contexts
  • Large image sizes: Multi-gigabyte images take longer to scan and may timeout in CI/CD pipelines

Scanning Tools Comparison

Tool Strengths ML-Specific Support Cost
Trivy Fast, comprehensive, scans OS + language packages, IaC, secrets Good Python/pip scanning, conda support Free / Open Source
Snyk Container Deep dependency analysis, fix recommendations, IDE integration Python ecosystem focus, pip and poetry support Free tier / Paid
Grype Fast CLI scanner, SBOM-based, works with Syft Good Python package scanning Free / Open Source
Docker Scout Integrated into Docker Desktop, policy-based remediation Growing ML framework support Free tier / Paid

Scanning CUDA and ML Framework Images

  1. Generate an SBOM First

    Use Syft or Trivy to generate a Software Bill of Materials (SBOM) for your ML image. This captures all OS packages, Python packages, and shared libraries. Store the SBOM alongside your image for audit trails.

  2. Scan with Multiple Tools

    No single scanner catches every vulnerability. Run at least two scanners (e.g., Trivy + Snyk) to maximize coverage. Each tool uses different vulnerability databases and detection methods.

  3. Set Severity Thresholds

    Configure your scanning policy to block images with Critical or High CVEs. Allow Medium and Low findings to be tracked as technical debt. Adjust thresholds based on whether the image runs in production or development.

  4. Handle CUDA-Specific CVEs

    CUDA libraries may have known CVEs that NVIDIA addresses through driver updates rather than library patches. Maintain a curated ignore list for CVEs that are mitigated by your host driver version.

CI/CD Integration Patterns

Build-Time Scanning

Scan images immediately after build in your CI pipeline. Fail the build if critical vulnerabilities are found. Use Trivy with --exit-code 1 for automatic enforcement.

Registry Admission

Configure your container registry (Harbor, ECR, GCR) to automatically scan images on push. Block deployment of unscanned or vulnerable images.

Continuous Monitoring

Rescan deployed images daily against updated vulnerability databases. New CVEs are published constantly — an image clean today may be vulnerable tomorrow.

Admission Controllers

Use Kubernetes admission controllers (OPA Gatekeeper, Kyverno) to enforce that only scanned and approved images can be deployed to production clusters.

💡
Next Up: In the next lesson, we explore runtime security — monitoring running ML containers with Falco, enforcing seccomp profiles, and detecting anomalous GPU access.