ML Infrastructure Questions
These 10 questions cover the infrastructure layer that supports production ML systems. This is where MLOps meets platform engineering — designing the systems that hundreds of data scientists and ML engineers use daily.
Q1: How do you run ML workloads on Kubernetes? What K8s resources are ML-specific?
Kubernetes is the dominant orchestration platform for ML workloads, but ML has requirements beyond standard web services:
- GPU scheduling: Use the NVIDIA device plugin for Kubernetes. Request GPUs in pod specs with
nvidia.com/gpu: 1in resource limits. Kubernetes schedules pods to nodes with available GPUs. Use node affinity to target specific GPU types (A100 for training, T4 for inference). - Training jobs: Use Kubernetes Jobs (not Deployments) for training because they run to completion and do not restart. For distributed training, use the Kubeflow Training Operator which provides CRDs for PyTorchJob, TFJob, and MPIJob that handle multi-node coordination automatically.
- Model serving: Use Deployments with Horizontal Pod Autoscaler (HPA) for inference. Configure HPA on custom metrics (GPU utilization, queue depth) not just CPU. Use KServe or Seldon Core for ML-specific serving features (multi-model serving, canary, A/B testing).
- Persistent storage: Training needs fast storage for data loading (local NVMe or high-IOPS EBS). Model artifacts go to object storage (S3/GCS) via PersistentVolumeClaims. Use ReadWriteMany volumes for distributed training across nodes.
- Spot/preemptible nodes: Use spot instances for training jobs (60–80% cost savings). Implement checkpointing so training can resume after preemption. Use on-demand instances for inference serving to ensure availability.
Q2: How do you manage GPU resources efficiently across multiple teams?
GPU management is one of the hardest infrastructure challenges because GPUs are expensive ($1–3/hour per GPU) and often underutilized:
- Resource quotas: Set per-namespace GPU quotas in Kubernetes so one team cannot monopolize the cluster. Example: Team A gets 8 GPUs max, Team B gets 16 GPUs. Track utilization to adjust quotas quarterly.
- GPU sharing: Use NVIDIA Multi-Instance GPU (MIG) to partition a single A100 into up to 7 isolated instances. Each instance has dedicated compute and memory. Perfect for inference workloads that do not need a full GPU. Alternatively, use NVIDIA GPU time-slicing for less isolation but more flexibility.
- Priority classes: Define Kubernetes PriorityClasses: production inference (highest), scheduled training (medium), ad-hoc experiments (lowest). When the cluster is full, low-priority pods are preempted to make room for high-priority workloads.
- Autoscaling: Use Cluster Autoscaler to add GPU nodes when demand exceeds capacity and remove them when idle. Set scale-down delays (10–15 minutes) to avoid thrashing. For bursty training workloads, use Karpenter (AWS) for faster node provisioning.
- Utilization monitoring: Track GPU utilization per team and per workload using DCGM (Data Center GPU Manager) metrics exported to Prometheus. Alert when GPUs are allocated but utilization is below 30% — this means someone reserved a GPU but is not using it.
Cost allocation: Tag GPU usage by team and project. Generate monthly showback/chargeback reports so teams understand their costs and are incentivized to release GPUs when not needed.
Q3: What is a feature store and why is it important for production ML?
A feature store is a centralized platform for storing, managing, and serving features (engineered data inputs) for ML models. It solves three critical problems:
- Training-serving consistency: The same feature computation code is used for both training (batch) and serving (online). Without a feature store, teams often have separate Spark jobs for training and Java services for serving that compute the same feature differently, causing training-serving skew.
- Feature reuse: In large organizations, multiple models often need the same features (e.g., "user's 7-day purchase count"). Without a feature store, each team computes it independently, wasting engineering time and compute resources. A feature store computes once, serves many.
- Point-in-time correctness: When building training datasets, you need features as they were at the time of each training example, not as they are today. Feature stores handle time travel queries to prevent data leakage (using future data to predict the past).
Architecture:
- Offline store: Data warehouse (BigQuery, Snowflake) for batch feature retrieval during training. High throughput, high latency (seconds).
- Online store: Low-latency key-value store (Redis, DynamoDB) for real-time feature retrieval during inference. Low latency (1–5 ms), lower throughput.
- Feature registry: Metadata catalog that documents each feature: owner, description, data source, freshness SLA, and which models use it.
Tools: Feast (open-source, flexible), Tecton (managed, enterprise), Vertex AI Feature Store (GCP-native), SageMaker Feature Store (AWS-native).
Q4: How do you design an experiment tracking system?
An experiment tracking system records everything about each ML experiment so results are reproducible and comparable:
- What to track per experiment: Hyperparameters (learning rate, batch size, architecture choices), metrics (training loss, validation accuracy, test set results), artifacts (model weights, plots, confusion matrices), code version (Git SHA), data version (DVC hash or dataset ID), environment (Docker image, GPU type), and run metadata (who ran it, when, why).
- Organization: Group experiments into projects (one per model or task). Within a project, each run is a single training execution. Allow tags and notes for human context ("trying larger learning rate after discussion with team").
- Comparison UI: Enable side-by-side comparison of runs: parameter differences, metric curves (training loss over epochs), and artifact comparison. This is where experiments become actionable — seeing that run 47 beat run 46 because of a specific hyperparameter change.
- Integration with model registry: Promote the best experiment run to a registered model version with one click. The model registry links back to the experiment that produced it, maintaining full lineage.
Tools: MLflow (open-source, self-hosted or managed), Weights & Biases (best UI, SaaS), Neptune.ai (good for large teams), Comet ML (strong comparison features). For most teams, MLflow is the right starting point because it is free, integrates with everything, and can be self-hosted for data privacy.
Q5: What is a model registry and how does it fit into the ML lifecycle?
A model registry is a versioned repository that stores model artifacts along with metadata, stage transitions, and lineage information. It is the single source of truth for "which model is deployed where."
Key capabilities:
- Model versioning: Store multiple versions of each model with immutable artifacts. Each version records: training metrics, evaluation results, the experiment that produced it, and the data version used.
- Stage management: Models progress through stages: None → Staging → Production → Archived. Stage transitions can require approval (human review) or automated validation gates. Only models in "Production" stage are served to users.
- Lineage tracking: For any model version, answer: "What data was it trained on? What code produced it? What experiment run generated it? Who approved it for production?" This is essential for debugging, auditing, and regulatory compliance.
- Deployment integration: Serving infrastructure reads from the registry. When a model is promoted to Production stage, the serving system automatically picks it up and starts serving it (via polling or webhooks).
- Access control: Role-based access: data scientists can register models, ML engineers can promote to staging, only senior engineers or automated gates can promote to production.
MLflow Model Registry is the most common choice. It provides all the above features with a Python API and UI. For cloud-native teams: Vertex AI Model Registry (GCP), SageMaker Model Registry (AWS), or Azure ML Registry.
Q6: How do you optimize ML infrastructure costs?
ML infrastructure is expensive because of GPU compute. Cost optimization is a continuous process, not a one-time exercise:
- Right-size instances: Many training jobs use A100 GPUs when a T4 would suffice. Profile your workload's GPU memory and compute requirements, then select the smallest instance that fits. An A100 costs 10x more than a T4 — using it for a small model is waste.
- Spot/preemptible instances: Use spot instances for training (60–80% savings). Implement checkpointing every 30 minutes so training resumes after preemption without starting over. Do not use spot for inference serving.
- Autoscaling: Scale inference endpoints to zero during off-peak hours if traffic allows. Use Knative or KEDA for scale-to-zero. For batch training, scale the cluster down when no jobs are queued.
- Model optimization: Quantize models from FP32 to FP16 or INT8. This halves or quarters GPU memory requirements, allowing smaller instances. Distill large models into smaller ones for inference. Use ONNX Runtime for 2–3x inference speedup.
- Reserved capacity: For steady-state inference workloads, commit to 1-year reserved instances (30–40% savings over on-demand). Only reserve what you consistently use; handle peaks with on-demand.
- Multi-model serving: Pack multiple small models onto a single GPU instance using Triton's concurrent model execution. One GPU serving 5 models is cheaper than 5 separate GPU instances.
Track unit economics: Compute cost per prediction and cost per training run. Report these metrics to leadership alongside model performance. A model that costs $0.001 per prediction and generates $0.50 in revenue per prediction is a great investment.
Q7: How do you handle distributed training across multiple GPUs/nodes?
Distributed training is necessary when a model or dataset is too large for a single GPU. Two main parallelism strategies:
- Data parallelism: Each GPU holds a complete copy of the model. The dataset is split across GPUs. Each GPU computes gradients on its shard, then gradients are averaged (AllReduce) across all GPUs before updating weights. Works for most models that fit in a single GPU's memory. Use PyTorch DDP (DistributedDataParallel) or DeepSpeed ZeRO.
- Model parallelism: The model is split across GPUs because it does not fit in one GPU's memory. Pipeline parallelism splits by layers (GPU 1 gets layers 1–12, GPU 2 gets layers 13–24). Tensor parallelism splits individual layers across GPUs. Required for LLMs with billions of parameters. Use Megatron-LM or DeepSpeed.
Infrastructure requirements:
- Network: High-bandwidth, low-latency interconnect between GPUs. NVLink for intra-node (900 GB/s), InfiniBand for inter-node (200–400 Gbps). Standard Ethernet (25 Gbps) creates a bottleneck for gradient synchronization.
- Storage: Fast shared storage for data loading. All nodes must read the same dataset. Use parallel file systems (Lustre, GPFS) or stream from object storage with prefetching.
- Fault tolerance: In a 64-GPU training job running for 3 days, the probability of at least one GPU failure is significant. Implement periodic checkpointing (every 30–60 minutes) and automatic restart from the latest checkpoint.
Q8: What is Infrastructure as Code (IaC) for ML and why does it matter?
Infrastructure as Code means managing all ML infrastructure (compute, storage, networking, ML services) through version-controlled configuration files rather than manual console clicks:
- Terraform: Define cloud resources (GPU instances, VPCs, IAM roles, S3 buckets, SageMaker endpoints) in HCL files. Apply changes with
terraform apply. Track state to detect drift between desired and actual infrastructure. - Kubernetes manifests: Define ML workloads (training jobs, inference deployments, feature store services) as YAML manifests. Use Helm charts for templating and Kustomize for environment-specific overrides (dev vs staging vs prod).
- GitOps: Store all infrastructure configs in Git. Use ArgoCD or Flux to automatically sync Kubernetes state with the Git repository. Every infrastructure change goes through a PR, gets reviewed, and is auditable.
Why it matters for ML specifically:
- Reproducibility: If a production model works but the training environment is gone, you cannot retrain. IaC lets you recreate identical training environments on demand.
- Environment parity: Dev, staging, and prod environments are defined from the same templates with different parameters. This prevents "works on my machine" problems.
- Disaster recovery: If a region goes down, spin up the entire ML platform in another region from the IaC definitions. Without IaC, disaster recovery is manual and slow.
- Compliance: Auditors want to see who changed what infrastructure, when, and why. IaC provides a complete Git history of every change.
Q9: How do you design a multi-tenant ML platform?
A multi-tenant ML platform serves multiple teams (tenants) on shared infrastructure while providing isolation, fairness, and self-service:
- Namespace isolation: Each team gets a Kubernetes namespace with resource quotas (CPU, memory, GPU limits). Network policies prevent cross-namespace communication unless explicitly allowed.
- Compute fairness: Use Kubernetes ResourceQuotas and LimitRanges to prevent one team from monopolizing shared resources. Implement priority-based scheduling so production workloads always get resources over development experiments.
- Data isolation: Each team's data is stored in separate buckets or database schemas. IAM policies ensure Team A cannot access Team B's data. This is critical for compliance (GDPR, HIPAA) and intellectual property protection.
- Self-service with guardrails: Provide a web UI or CLI where data scientists can launch training jobs, deploy models, and view experiments without filing tickets. But enforce guardrails: maximum training duration, maximum GPU count, approved Docker base images, mandatory logging.
- Shared services: Feature store, model registry, experiment tracker, and monitoring stack are shared across tenants. Each tenant sees only their own data but benefits from the shared infrastructure investment.
- Cost allocation: Tag all resources by team and project. Generate monthly cost reports per tenant. Implement chargeback or showback so teams are accountable for their infrastructure spending.
Q10: How do you handle ML model security in production?
ML model security is an emerging field with threats unique to ML systems:
- Model theft: An attacker queries your API thousands of times to build a copy of your model (model extraction attack). Mitigation: rate limiting, monitoring query patterns for extraction signatures, watermarking model outputs.
- Adversarial inputs: Specially crafted inputs designed to make the model produce wrong outputs. Example: slightly modified images that fool classifiers. Mitigation: adversarial training, input validation, anomaly detection on input distributions.
- Data poisoning: An attacker injects malicious data into the training pipeline to manipulate model behavior. Mitigation: data provenance tracking, anomaly detection on training data, robust training methods.
- Model artifact security: Model files can contain embedded code (pickle files execute arbitrary Python). Mitigation: use safe serialization formats (ONNX, SavedModel), scan model artifacts for malware, sign model artifacts with cryptographic hashes.
- Inference endpoint security: Standard API security: authentication (API keys, OAuth), authorization (RBAC), encryption (TLS), input validation (schema enforcement, size limits), and audit logging.
- Supply chain security: Pre-trained models from HuggingFace or other hubs can contain backdoors. Mitigation: scan downloaded models, verify checksums, test on known inputs before deploying, maintain an approved model registry.
Framework: OWASP Machine Learning Security Top 10 provides a structured approach to ML security threats. Integrate ML security reviews into your deployment pipeline just like code security reviews.
Lilly Tech Systems