Intermediate
Networking for AI Infrastructure
Configure VPCs, placement groups, Elastic Fabric Adapter (EFA), and security groups for high-performance distributed GPU training with Terraform.
VPC for AI Workloads
resource "aws_vpc" "ai_vpc" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
tags = { Name = "ai-training-vpc" }
}
resource "aws_subnet" "gpu_subnet" {
vpc_id = aws_vpc.ai_vpc.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = { Name = "gpu-training-subnet" }
}
# Placement group for low-latency GPU-to-GPU communication
resource "aws_placement_group" "gpu_cluster" {
name = "gpu-training-cluster"
strategy = "cluster" # Place instances physically close
}
High-Bandwidth Networking (EFA)
resource "aws_instance" "gpu_node" {
instance_type = "p4d.24xlarge"
placement_group = aws_placement_group.gpu_cluster.id
subnet_id = aws_subnet.gpu_subnet.id
network_interface {
device_index = 0
network_interface_id = aws_network_interface.efa.id
}
}
resource "aws_network_interface" "efa" {
subnet_id = aws_subnet.gpu_subnet.id
interface_type = "efa" # Elastic Fabric Adapter for RDMA
security_groups = [aws_security_group.gpu_sg.id]
}
EFA performance: EFA provides 400 Gbps bandwidth with RDMA support, critical for multi-node distributed training with NCCL. Place all training nodes in the same placement group and availability zone.
Security: GPU training nodes should be in private subnets with no public internet access. Use NAT gateways for outbound traffic (pip install) and VPC endpoints for S3/ECR access to avoid data egress charges.
Lilly Tech Systems