K8s Architecture for ML Beginner

Understanding Kubernetes architecture is essential for running ML workloads effectively. This lesson covers how the K8s control plane schedules GPU workloads, how device plugins expose accelerators, and the node topology for ML clusters.

K8s Components for ML

Component	Role in ML
kube-scheduler	Assigns GPU pods to nodes with available GPU resources
kubelet	Manages pod lifecycle and GPU device allocation on each node
NVIDIA Device Plugin	Discovers GPUs and advertises `nvidia.com/gpu` as a schedulable resource
NVIDIA GPU Operator	Automates driver installation, device plugin, monitoring across GPU nodes
Container Runtime	NVIDIA Container Toolkit enables GPU access inside containers

GPU Scheduling Flow

Pod requests GPU
Pod spec includes nvidia.com/gpu: 1 in resource limits.
Scheduler finds a node
kube-scheduler finds a node with available GPU resources advertised by the device plugin.
Kubelet allocates GPU
The kubelet on the selected node allocates a specific GPU device to the pod via the device plugin.
Container accesses GPU
NVIDIA Container Toolkit mounts the GPU device and drivers into the container.

Node Topology for ML Clusters

Organize your cluster with separate node pools for different workload types:

System node pool: Small CPU instances for K8s system pods (CoreDNS, metrics-server)
CPU node pool: General-purpose nodes for data preprocessing and pipeline orchestration
GPU training pool: GPU nodes (A100, H100) with taints for training jobs only
GPU inference pool: GPU nodes (T4, L4) for model serving with autoscaling

NVIDIA GPU Operator

Bash

# Install NVIDIA GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true

Best Practice: Use the NVIDIA GPU Operator instead of manually installing drivers on each node. It handles driver installation, device plugin deployment, container toolkit configuration, and DCGM monitoring exporter as a unified solution.

← Introduction Namespaces →