ML Workloads
Running machine learning training jobs and model serving endpoints on Kubernetes — Jobs, CronJobs, Deployments, operators, and distributed training patterns.
Training Jobs with Kubernetes Jobs
A Kubernetes Job creates one or more Pods and ensures they run to completion. This is the natural fit for ML training — a training script runs, saves a model, and exits.
# Training job with GPU
apiVersion: batch/v1
kind: Job
metadata:
name: bert-finetune-job
namespace: ml-training
spec:
backoffLimit: 3
activeDeadlineSeconds: 86400 # 24-hour timeout
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: myregistry/bert-trainer:v1.0
command: ["python", "finetune.py",
"--model", "bert-base-uncased",
"--epochs", "10",
"--output", "/models/bert-finetuned"]
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8"
volumeMounts:
- name: model-storage
mountPath: /models
- name: dataset
mountPath: /data
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
- name: dataset
persistentVolumeClaim:
claimName: dataset-pvc
Key Job Settings for ML
- backoffLimit — Number of retries before marking the Job as failed. Set to 2-3 for transient failures (GPU memory errors, preemptions).
- activeDeadlineSeconds — Maximum time the Job can run. Prevents runaway training jobs from consuming GPUs indefinitely.
- restartPolicy: Never — For training, use Never (not OnFailure) so you can inspect failed Pod logs. The Job controller creates a new Pod on failure.
- ttlSecondsAfterFinished — Automatically clean up completed Jobs after a specified time. Keeps the namespace tidy.
Scheduled Training with CronJobs
CronJobs run Jobs on a schedule. Use them for periodic model retraining, data pipeline runs, or evaluation jobs.
# Weekly model retraining
apiVersion: batch/v1
kind: CronJob
metadata:
name: weekly-retrain
namespace: ml-training
spec:
schedule: "0 2 * * 0" # Every Sunday at 2 AM
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: retrainer
image: myregistry/model-retrainer:v2.0
resources:
limits:
nvidia.com/gpu: 2
memory: "32Gi"
Forbid for ML training to prevent a new training run from starting while the previous one is still running. Replace kills the current Job and starts a new one. Allow (default) lets them run concurrently, which wastes GPU resources.Model Serving with Deployments
For inference (serving predictions), use Deployments with Horizontal Pod Autoscaler (HPA) to handle varying load.
# Model serving deployment with HPA
apiVersion: apps/v1
kind: Deployment
metadata:
name: sentiment-model
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: sentiment-api
template:
metadata:
labels:
app: sentiment-api
spec:
containers:
- name: server
image: myregistry/sentiment-server:v2.1
ports:
- containerPort: 8080
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sentiment-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sentiment-model
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
ML Operators and Frameworks
The Kubernetes ecosystem has operators specifically designed for ML workloads.
Kubeflow Training Operator
Kubeflow provides custom resources for distributed training across frameworks:
- TFJob — Distributed TensorFlow training with parameter servers or MirroredStrategy
- PyTorchJob — Distributed PyTorch training with DistributedDataParallel
- MPIJob — Training with MPI (Horovod) for multi-node GPU training
- XGBoostJob — Distributed XGBoost training
KServe (formerly KFServing)
KServe provides a standardized model serving interface on Kubernetes with features like canary deployments, autoscaling to zero, and multi-model serving.
Argo Workflows
Orchestrate multi-step ML pipelines (data prep, training, evaluation, deployment) as directed acyclic graphs (DAGs) on Kubernetes.
Distributed Training Pattern
For large models that do not fit on a single GPU, distributed training spreads the workload across multiple Pods.
# Simplified PyTorchJob for distributed training
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: distributed-bert
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: myregistry/bert-distributed:v1.0
resources:
limits:
nvidia.com/gpu: 4
Worker:
replicas: 3
template:
spec:
containers:
- name: pytorch
image: myregistry/bert-distributed:v1.0
resources:
limits:
nvidia.com/gpu: 4
Practice Questions
A) Deployment
B) Job
C) DaemonSet
D) CronJob
Show Answer
B) Job. A Job is designed for tasks that run to completion. Set backoffLimit: 2 for retries. Deployments are for long-running services, DaemonSets run one Pod per node, and CronJobs are for scheduled recurring tasks (not a one-time training run).
A) ReplicaSet
B) ResourceQuota
C) HorizontalPodAutoscaler
D) PodDisruptionBudget
Show Answer
C) HorizontalPodAutoscaler. An HPA automatically adjusts the number of replicas in a Deployment based on observed metrics (CPU, memory, or custom metrics). Set minReplicas: 2, maxReplicas: 10, and target CPU utilization. ReplicaSets maintain a fixed number of replicas without autoscaling.
A)
backoffLimit: 0B)
concurrencyPolicy: ForbidC)
suspend: trueD)
parallelism: 1
Show Answer
B) concurrencyPolicy: Forbid. This tells the CronJob controller to skip the new run if the previous one is still active. Replace would kill the running job and start a new one. Allow lets both run concurrently. suspend: true stops all future runs, and parallelism is a Job-level setting for running multiple Pods in parallel within a single Job.
A) Liveness probe with initialDelaySeconds: 90
B) Readiness probe with initialDelaySeconds: 90
C) Startup probe with failureThreshold: 30 and periodSeconds: 5
D) Both B and C
Show Answer
D) Both B and C. The startup probe handles the initial long loading time (30 checks x 5 seconds = 150 seconds max), preventing the liveness probe from killing the Pod during model loading. Once the startup probe succeeds, the readiness probe takes over to manage traffic routing. This combination is the best practice for ML serving Pods with long initialization times.
restartPolicy: OnFailure. The training script fails due to an out-of-memory error. What happens?A) The Pod is deleted and a new Pod is created
B) The container is restarted in the same Pod
C) The Job is marked as failed
D) The Pod enters CrashLoopBackOff
Show Answer
B) The container is restarted in the same Pod. With restartPolicy: OnFailure, the kubelet restarts the failed container within the same Pod. This means any local data in the container's writable layer is lost, but emptyDir volumes survive. For ML training, restartPolicy: Never is often preferred so you can inspect failed Pod logs and the Job controller creates a fresh Pod for retries.
Lilly Tech Systems