Intermediate

Storage for AI Workloads

Set up object storage for datasets, high-performance shared file systems for training, and model registries using Terraform across cloud providers.

S3 for ML Datasets

resource "aws_s3_bucket" "ml_data" {
  bucket = "my-org-ml-datasets"
  tags   = { Project = "ml-platform" }
}

resource "aws_s3_bucket_versioning" "ml_data" {
  bucket = aws_s3_bucket.ml_data.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_lifecycle_configuration" "ml_data" {
  bucket = aws_s3_bucket.ml_data.id
  rule {
    id     = "archive-old-datasets"
    status = "Enabled"
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}

FSx for Lustre (High-Performance)

resource "aws_fsx_lustre_file_system" "training_data" {
  storage_capacity            = 4800  # GB
  subnet_ids                  = [aws_subnet.gpu_subnet.id]
  deployment_type             = "PERSISTENT_2"
  per_unit_storage_throughput = 250   # MB/s per TB

  data_repository_association {
    file_system_path   = "/datasets"
    data_repository_path = "s3://${aws_s3_bucket.ml_data.id}/datasets"
    batch_import_meta_data_on_create = true
  }

  tags = { Name = "ai-training-lustre" }
}
💡
Storage tiers: Use object storage (S3/GCS) for long-term dataset and model storage. Use high-performance file systems (FSx Lustre, Filestore) for active training. This pattern reduces costs while maintaining training performance.
VPC Endpoints: Always create S3 VPC endpoints for GPU training nodes. This eliminates data transfer costs and provides faster, more reliable access to training data compared to going through the public internet.