Intermediate
Storage for AI Workloads
Set up object storage for datasets, high-performance shared file systems for training, and model registries using Terraform across cloud providers.
S3 for ML Datasets
resource "aws_s3_bucket" "ml_data" {
bucket = "my-org-ml-datasets"
tags = { Project = "ml-platform" }
}
resource "aws_s3_bucket_versioning" "ml_data" {
bucket = aws_s3_bucket.ml_data.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_lifecycle_configuration" "ml_data" {
bucket = aws_s3_bucket.ml_data.id
rule {
id = "archive-old-datasets"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
}
}
FSx for Lustre (High-Performance)
resource "aws_fsx_lustre_file_system" "training_data" {
storage_capacity = 4800 # GB
subnet_ids = [aws_subnet.gpu_subnet.id]
deployment_type = "PERSISTENT_2"
per_unit_storage_throughput = 250 # MB/s per TB
data_repository_association {
file_system_path = "/datasets"
data_repository_path = "s3://${aws_s3_bucket.ml_data.id}/datasets"
batch_import_meta_data_on_create = true
}
tags = { Name = "ai-training-lustre" }
}
Storage tiers: Use object storage (S3/GCS) for long-term dataset and model storage. Use high-performance file systems (FSx Lustre, Filestore) for active training. This pattern reduces costs while maintaining training performance.
VPC Endpoints: Always create S3 VPC endpoints for GPU training nodes. This eliminates data transfer costs and provides faster, more reliable access to training data compared to going through the public internet.
Lilly Tech Systems