Intermediate

ML Workloads

Running machine learning training jobs and model serving endpoints on Kubernetes — Jobs, CronJobs, Deployments, operators, and distributed training patterns.

Training Jobs with Kubernetes Jobs

A Kubernetes Job creates one or more Pods and ensures they run to completion. This is the natural fit for ML training — a training script runs, saves a model, and exits.

# Training job with GPU
apiVersion: batch/v1
kind: Job
metadata:
  name: bert-finetune-job
  namespace: ml-training
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 86400  # 24-hour timeout
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: trainer
        image: myregistry/bert-trainer:v1.0
        command: ["python", "finetune.py",
          "--model", "bert-base-uncased",
          "--epochs", "10",
          "--output", "/models/bert-finetuned"]
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "8"
        volumeMounts:
        - name: model-storage
          mountPath: /models
        - name: dataset
          mountPath: /data
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      - name: dataset
        persistentVolumeClaim:
          claimName: dataset-pvc

Key Job Settings for ML

backoffLimit — Number of retries before marking the Job as failed. Set to 2-3 for transient failures (GPU memory errors, preemptions).
activeDeadlineSeconds — Maximum time the Job can run. Prevents runaway training jobs from consuming GPUs indefinitely.
restartPolicy: Never — For training, use Never (not OnFailure) so you can inspect failed Pod logs. The Job controller creates a new Pod on failure.
ttlSecondsAfterFinished — Automatically clean up completed Jobs after a specified time. Keeps the namespace tidy.

Scheduled Training with CronJobs

CronJobs run Jobs on a schedule. Use them for periodic model retraining, data pipeline runs, or evaluation jobs.

# Weekly model retraining
apiVersion: batch/v1
kind: CronJob
metadata:
  name: weekly-retrain
  namespace: ml-training
spec:
  schedule: "0 2 * * 0"  # Every Sunday at 2 AM
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: retrainer
            image: myregistry/model-retrainer:v2.0
            resources:
              limits:
                nvidia.com/gpu: 2
                memory: "32Gi"

💡

concurrencyPolicy: Set to Forbid for ML training to prevent a new training run from starting while the previous one is still running. Replace kills the current Job and starts a new one. Allow (default) lets them run concurrently, which wastes GPU resources.

Model Serving with Deployments

For inference (serving predictions), use Deployments with Horizontal Pod Autoscaler (HPA) to handle varying load.

# Model serving deployment with HPA
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sentiment-model
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sentiment-api
  template:
    metadata:
      labels:
        app: sentiment-api
    spec:
      containers:
      - name: server
        image: myregistry/sentiment-server:v2.1
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 120
          periodSeconds: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sentiment-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sentiment-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

ML Operators and Frameworks

The Kubernetes ecosystem has operators specifically designed for ML workloads.

Kubeflow Training Operator

Kubeflow provides custom resources for distributed training across frameworks:

TFJob — Distributed TensorFlow training with parameter servers or MirroredStrategy
PyTorchJob — Distributed PyTorch training with DistributedDataParallel
MPIJob — Training with MPI (Horovod) for multi-node GPU training
XGBoostJob — Distributed XGBoost training

KServe (formerly KFServing)

KServe provides a standardized model serving interface on Kubernetes with features like canary deployments, autoscaling to zero, and multi-model serving.

Argo Workflows

Orchestrate multi-step ML pipelines (data prep, training, evaluation, deployment) as directed acyclic graphs (DAGs) on Kubernetes.

Distributed Training Pattern

For large models that do not fit on a single GPU, distributed training spreads the workload across multiple Pods.

# Simplified PyTorchJob for distributed training
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-bert
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: myregistry/bert-distributed:v1.0
            resources:
              limits:
                nvidia.com/gpu: 4
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: pytorch
            image: myregistry/bert-distributed:v1.0
            resources:
              limits:
                nvidia.com/gpu: 4

⚠

CKA scope: The CKA does not test Kubeflow, KServe, or Argo directly. However, you should understand the underlying Kubernetes primitives (Jobs, Deployments, Services, PVCs) that these tools build upon. The exam tests your ability to create and troubleshoot these core resources.

Practice Questions

📝

Q1: A data scientist wants to run a training script that takes 8 hours to complete. The script should save a model checkpoint to persistent storage and exit. If the script fails, it should retry up to 2 times. Which Kubernetes resource is most appropriate?

A) Deployment
B) Job
C) DaemonSet
D) CronJob

Show Answer

B) Job. A Job is designed for tasks that run to completion. Set backoffLimit: 2 for retries. Deployments are for long-running services, DaemonSets run one Pod per node, and CronJobs are for scheduled recurring tasks (not a one-time training run).

📝

Q2: A model serving endpoint needs to automatically scale from 2 to 10 replicas based on CPU utilization. Which Kubernetes resource should you create?

A) ReplicaSet
B) ResourceQuota
C) HorizontalPodAutoscaler
D) PodDisruptionBudget

Show Answer

C) HorizontalPodAutoscaler. An HPA automatically adjusts the number of replicas in a Deployment based on observed metrics (CPU, memory, or custom metrics). Set minReplicas: 2, maxReplicas: 10, and target CPU utilization. ReplicaSets maintain a fixed number of replicas without autoscaling.

📝

Q3: You schedule a CronJob for weekly model retraining. A new retraining job triggers while the previous one is still running. You want to prevent the new job from starting. Which field should you set?

A) backoffLimit: 0
B) concurrencyPolicy: Forbid
C) suspend: true
D) parallelism: 1

Show Answer

B) concurrencyPolicy: Forbid. This tells the CronJob controller to skip the new run if the previous one is still active. Replace would kill the running job and start a new one. Allow lets both run concurrently. suspend: true stops all future runs, and parallelism is a Job-level setting for running multiple Pods in parallel within a single Job.

📝

Q4: A model serving Pod takes 90 seconds to load the model into memory before it can handle requests. Which probe configuration is most appropriate?

A) Liveness probe with initialDelaySeconds: 90
B) Readiness probe with initialDelaySeconds: 90
C) Startup probe with failureThreshold: 30 and periodSeconds: 5
D) Both B and C

Show Answer

D) Both B and C. The startup probe handles the initial long loading time (30 checks x 5 seconds = 150 seconds max), preventing the liveness probe from killing the Pod during model loading. Once the startup probe succeeds, the readiness probe takes over to manage traffic routing. This combination is the best practice for ML serving Pods with long initialization times.

📝

Q5: A training Job has restartPolicy: OnFailure. The training script fails due to an out-of-memory error. What happens?

A) The Pod is deleted and a new Pod is created
B) The container is restarted in the same Pod
C) The Job is marked as failed
D) The Pod enters CrashLoopBackOff

Show Answer

B) The container is restarted in the same Pod. With restartPolicy: OnFailure, the kubelet restarts the failed container within the same Pod. This means any local data in the container's writable layer is lost, but emptyDir volumes survive. For ML training, restartPolicy: Never is often preferred so you can inspect failed Pod logs and the Job controller creates a fresh Pod for retries.

← Previous GPU Scheduling Next → Networking & Storage