K8s Architecture for ML Beginner
Understanding Kubernetes architecture is essential for running ML workloads effectively. This lesson covers how the K8s control plane schedules GPU workloads, how device plugins expose accelerators, and the node topology for ML clusters.
K8s Components for ML
| Component | Role in ML |
|---|---|
| kube-scheduler | Assigns GPU pods to nodes with available GPU resources |
| kubelet | Manages pod lifecycle and GPU device allocation on each node |
| NVIDIA Device Plugin | Discovers GPUs and advertises nvidia.com/gpu as a schedulable resource |
| NVIDIA GPU Operator | Automates driver installation, device plugin, monitoring across GPU nodes |
| Container Runtime | NVIDIA Container Toolkit enables GPU access inside containers |
GPU Scheduling Flow
- Pod requests GPU
Pod spec includes
nvidia.com/gpu: 1in resource limits. - Scheduler finds a node
kube-scheduler finds a node with available GPU resources advertised by the device plugin.
- Kubelet allocates GPU
The kubelet on the selected node allocates a specific GPU device to the pod via the device plugin.
- Container accesses GPU
NVIDIA Container Toolkit mounts the GPU device and drivers into the container.
Node Topology for ML Clusters
Organize your cluster with separate node pools for different workload types:
- System node pool: Small CPU instances for K8s system pods (CoreDNS, metrics-server)
- CPU node pool: General-purpose nodes for data preprocessing and pipeline orchestration
- GPU training pool: GPU nodes (A100, H100) with taints for training jobs only
- GPU inference pool: GPU nodes (T4, L4) for model serving with autoscaling
NVIDIA GPU Operator
Bash
# Install NVIDIA GPU Operator via Helm
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true
Best Practice: Use the NVIDIA GPU Operator instead of manually installing drivers on each node. It handles driver installation, device plugin deployment, container toolkit configuration, and DCGM monitoring exporter as a unified solution.
Lilly Tech Systems