Cloud AI Services Questions
Cloud AI platforms are central to most enterprise ML infrastructure. These 10 questions cover the major cloud AI services, managed vs self-hosted trade-offs, cost analysis, and architecture patterns that interviewers expect you to discuss with depth and nuance.
Q1: Compare AWS SageMaker, GCP Vertex AI, and Azure ML. What are the strengths and weaknesses of each?
| Feature | AWS SageMaker | GCP Vertex AI | Azure ML |
|---|---|---|---|
| Strengths | Most mature, broadest feature set, SageMaker Studio IDE, built-in model monitoring, largest ecosystem | Best TPU integration, Vertex AI Pipelines (Kubeflow-based), strong AutoML, tight BigQuery integration | Best enterprise integration (Active Directory, Azure DevOps), responsible AI toolkit, strong for .NET shops |
| Weaknesses | Complex pricing, vendor lock-in with SageMaker-specific APIs, steep learning curve | Smaller ecosystem, less third-party integration, documentation gaps | Slower feature releases, smaller ML community, less mature managed training |
| GPU Options | P4d (A100), P5 (H100), Trn1 (Trainium), Inf2 (Inferentia) | A100, H100, TPU v4/v5p/v5e | A100, H100, ND-series (InfiniBand-connected) |
| Model Serving | SageMaker Endpoints (real-time, async, serverless) | Vertex AI Endpoints, with traffic splitting and auto-scaling | Azure ML Managed Online Endpoints, AKS integration |
| Training | SageMaker Training Jobs, distributed training with built-in data parallelism | Vertex AI Custom Training, TPU pod slices for large-scale training | Azure ML Compute Clusters, InfiniBand support for distributed training |
| Cost | Premium pricing, Trainium/Inferentia cheaper alternatives | Competitive, committed use discounts, TPUs cost-effective at scale | Competitive, reserved capacity, enterprise agreements |
Interview tip: Do not just list features. Explain when you would choose each: SageMaker for teams already on AWS wanting a one-stop shop, Vertex AI for TPU-heavy training or organizations using BigQuery, Azure ML for enterprises with existing Microsoft infrastructure. Always mention that the underlying ML concepts are cloud-agnostic.
Q2: When should you use managed ML services vs self-hosted infrastructure?
Answer: This is a nuanced decision that depends on team size, workload characteristics, and organizational maturity:
Choose managed services when:
- Small team (1–5 ML engineers): Cannot afford the operational overhead of managing Kubernetes, GPU drivers, networking, and monitoring. Managed services let them focus on ML, not infrastructure.
- Standard workloads: Training standard models (fine-tuning, transfer learning), serving at moderate scale. Managed services handle this well out of the box.
- Quick iteration: Startups and research teams that need to move fast. SageMaker notebook instances or Vertex AI Workbench eliminate infrastructure setup time.
- Compliance requirements: Managed services often come with built-in security, logging, and compliance certifications (SOC2, HIPAA) that would be expensive to implement on self-hosted infrastructure.
Choose self-hosted when:
- Large scale (100+ GPUs): Managed services charge premium margins (30–50% over raw compute cost). At scale, self-hosted Kubernetes on cloud VMs or bare metal is significantly cheaper.
- Custom requirements: Non-standard GPU topologies, custom NCCL configurations, specific InfiniBand setups, or novel distributed training patterns that managed services do not support.
- Multi-cloud or hybrid: Self-hosted Kubernetes is portable across clouds. Managed services create deep vendor lock-in.
- Large platform team (5+ infra engineers): The team has the expertise to operate and optimize infrastructure, and the cost savings justify the operational investment.
Hybrid approach (most common at mid-size companies): Use managed services for model serving and experiment tracking. Use self-hosted Kubernetes for large-scale training. Use cloud storage (S3/GCS) for data and checkpoints.
Q3: How do you estimate and optimize cloud costs for ML workloads?
Answer: ML cloud costs are dominated by GPU compute (60–80%), with storage (10–20%) and networking (5–10%) as secondary costs.
Cost estimation framework:
- Training cost: (GPU count) x (hours) x (GPU price/hr). Factor in GPU utilization — if GPUs are 50% utilized, effective cost doubles.
- Inference cost: (GPU count) x (hours/month) x (GPU price/hr). Key metric: cost per 1,000 predictions. Compare against the business value of each prediction.
- Storage cost: Datasets + checkpoints + model artifacts. Training checkpoints for a 70B model can be 500+ GB per checkpoint, 100+ checkpoints per run = 50+ TB.
- Data transfer: Cross-region or cross-cloud data movement. 1 TB egress from AWS = ~$90. Can be significant for large datasets.
Optimization strategies (in order of impact):
- Spot/preemptible instances (60–70% savings): Use for all training workloads with checkpointing. Spot interruption rate for GPU instances is typically 5–15%.
- Reserved capacity (30–40% savings): For predictable workloads running 24/7 (inference endpoints, continuous training).
- Right-size GPU selection: Do not use H100s for inference workloads that fit on T4s. Profile model requirements before selecting instance type.
- Auto-scaling inference: Scale down during off-peak hours. Scale to zero for batch inference endpoints.
- Alternative accelerators: AWS Inferentia2 is 50% cheaper than A10G for supported model architectures. Google TPUs are cost-effective for large-batch training.
- Eliminate waste: Audit for idle GPU instances, forgotten notebooks, orphaned storage volumes. Implement auto-shutdown policies.
Q4: How does AWS Trainium compare to NVIDIA GPUs for ML training?
Answer: AWS Trainium (Trn1/Trn2 instances) is Amazon's custom AI training chip, competing directly with NVIDIA GPUs on price-performance for supported workloads.
Key specs (Trn1.32xlarge): 16 Trainium chips, 512 GB HBM, up to 800 Gbps inter-chip bandwidth via NeuronLink. ~40% cheaper per TFLOP than comparable NVIDIA instances on AWS.
Advantages:
- Significantly lower cost per training hour vs p5.48xlarge (H100)
- Purpose-built for training with high memory bandwidth and efficient collective operations
- Neuron SDK integrates with PyTorch via torch-neuronx (XLA-based compilation)
- Available on-demand with better availability than H100 instances
Limitations:
- Model compatibility: Not all model architectures and operations are supported. Custom CUDA kernels do not work. Models must be compilable via XLA/Neuron compiler.
- Ecosystem: Much smaller ecosystem than CUDA. Fewer debugging tools, profilers, and community resources.
- Compiler overhead: First compilation can take hours for large models. Graph breaks cause performance degradation.
- Vendor lock-in: Only available on AWS. No portability to other clouds or on-premises.
When to use: Standard transformer architectures (LLM fine-tuning, BERT, GPT-style models) where cost optimization is the priority and your team can handle the Neuron SDK learning curve. When not to use: novel architectures, models with custom operators, or when portability matters.
Q5: How do you design a multi-region or multi-cloud AI infrastructure?
Answer: Multi-region/multi-cloud AI infrastructure is driven by three needs: GPU availability, data residency requirements, and disaster recovery.
Multi-region architecture:
- Training: Run in the region with best GPU availability and lowest cost. Training is batch and can tolerate latency. Use one primary region with failover to secondary.
- Inference: Deploy model replicas in each region close to users. Each region serves its own traffic independently. Model artifacts synced from a central model registry.
- Data: Store training data in the region where it is generated or where regulations require (GDPR for EU data). Replicate to training region. Use data versioning (DVC, LakeFS) for consistency.
Multi-cloud architecture:
- Why: GPU shortages on one cloud, best-of-breed services (TPUs only on GCP, Trainium only on AWS), reduce vendor dependency, negotiate better pricing.
- Abstraction layer: Use Kubernetes as the common orchestration layer across clouds. Use Terraform/Pulumi for infrastructure-as-code that targets multiple clouds.
- Model portability: Standardize on ONNX or framework-native formats (TorchScript) rather than cloud-specific formats. Use open-source serving (Triton, vLLM) instead of cloud-specific serving.
- Data sync: Use cloud-agnostic storage abstractions (fsspec, smart_open) or data mesh patterns. Minimize cross-cloud data transfers — train where data lives.
Interview insight: Multi-cloud adds significant operational complexity. Only recommend it when there is a clear business justification (GPU availability, regulatory, negotiation leverage). Most companies are better served by a single cloud with multi-region deployment.
Q6: How do you set up model serving infrastructure for production?
Answer: Production model serving infrastructure must handle reliability, latency, throughput, and cost optimization:
Serving stack options:
- NVIDIA Triton Inference Server: Multi-framework (PyTorch, TensorFlow, ONNX, TensorRT), dynamic batching, model ensemble, concurrent model execution. Best for GPU inference with multiple models.
- vLLM: Purpose-built for LLM serving. PagedAttention for efficient GPU memory management, continuous batching, tensor parallelism. Best for LLM inference.
- TensorRT-LLM: NVIDIA's optimized LLM inference with quantization, kernel fusion, and inflight batching. Highest throughput for NVIDIA GPUs but requires model-specific compilation.
- Managed services: SageMaker Endpoints, Vertex AI Endpoints. Higher cost but lower operational burden.
Production architecture:
- Load balancer: Route requests across multiple model server replicas. Health check endpoints to remove unhealthy replicas.
- Request queue: Buffer requests during traffic spikes. Prevents overloading model servers and enables backpressure.
- Auto-scaling: Scale based on GPU utilization, request queue depth, or latency P99. Maintain warm replicas for immediate scaling.
- Model versioning: Blue-green or canary deployment. Route percentage of traffic to new model version. Roll back automatically if error rate increases.
- Monitoring: Latency (P50, P95, P99), throughput (requests/sec), GPU utilization, error rate, model prediction distribution.
Q7: What is serverless GPU inference and when does it make sense?
Answer: Serverless GPU inference provisions GPU resources on-demand for each request, scaling to zero when idle. Examples: AWS SageMaker Serverless Inference, Modal, Replicate, Banana.dev.
When it makes sense:
- Sporadic traffic: API that receives 100 requests/day. A dedicated GPU instance at $3/hour costs $2,160/month. Serverless might cost $5–10/month.
- Prototyping: Quick deployment without managing infrastructure. Deploy a model endpoint in minutes.
- Batch processing with variable load: Process a queue of items where volume varies 100x day-to-day.
When it does NOT make sense:
- High-throughput, steady traffic: Dedicated instances are cheaper when GPUs are utilized >30% of the time.
- Latency-sensitive: Cold start (loading model into GPU memory) takes 30–120 seconds for large models. Unacceptable for real-time APIs.
- Large models: Models >10 GB have long cold starts and high per-request cost due to memory allocation.
Emerging solutions: Platforms like Modal and Replicate use container snapshots to reduce cold start to 1–5 seconds by pre-loading model weights into a snapshot that can be restored quickly. This makes serverless viable for a broader range of use cases.
Q8: How do you handle GPU instance availability issues in the cloud?
Answer: GPU instance shortages have been persistent since 2023 due to AI demand. Strategies to secure capacity:
- Reserved instances / Committed Use Discounts: 1–3 year commitments guarantee capacity at 30–60% discount. Best for predictable baseline workloads.
- Capacity reservations: AWS on-demand capacity reservations, GCP reservations. Guarantee specific GPU instances in specific availability zones without commitment discount.
- Multi-region / multi-AZ: Spread workloads across regions. GPU availability varies significantly by region. Build infrastructure that can failover between regions.
- Multi-cloud: If AWS H100s are unavailable, try GCP or Azure. Requires portable infrastructure (Kubernetes, Terraform).
- Alternative GPU types: If H100s are unavailable, can you use A100s with more GPUs? If p5 instances are full, try p4d. Quantize models to fit on smaller GPUs.
- Spot instances with fallback: Try spot first (cheapest). If spot is unavailable, fall back to on-demand. If on-demand is unavailable, try a different region or GPU type.
- GPU cloud providers: CoreWeave, Lambda Labs, Together AI offer dedicated GPU capacity with better availability for AI workloads than hyperscalers.
Proactive capacity planning: Monitor GPU utilization trends. Forecast demand 3–6 months out. Request capacity increases from cloud providers well in advance. Maintain relationships with your cloud account team for priority access during shortages.
Q9: How do you implement infrastructure-as-code for ML infrastructure?
Answer: ML infrastructure has unique IaC requirements beyond typical cloud infrastructure:
Terraform (most common):
- Define GPU node pools, VPC/networking, IAM roles, storage buckets, and managed ML services
- Use modules for reusable patterns: "training cluster" module, "inference endpoint" module
- State management: use remote state with locking (S3 + DynamoDB, GCS) for team collaboration
- Workspace or Terragrunt for managing dev/staging/production environments
Kubernetes manifests (Helm, Kustomize):
- Define GPU device plugins, operators, training jobs, serving deployments as code
- Helm charts for templated deployments (parameterize GPU count, model name, resource limits)
- Kustomize overlays for environment-specific configurations (dev: 1 GPU, prod: 8 GPUs)
- GitOps with ArgoCD or Flux: all K8s changes go through Git, auto-synced to clusters
ML-specific IaC patterns:
- Training job templates: Parameterized templates for launching training jobs (model config, data path, GPU count, hyperparameters). Researchers submit jobs without knowing K8s.
- Experiment environments: Self-service GPU notebook environments provisioned via IaC. Auto-shutdown after idle timeout.
- Model serving: Automated endpoint provisioning triggered by model registry events. New model version pushed = Terraform creates/updates serving endpoint.
Q10: How do you handle data gravity in cloud AI architecture?
Answer: Data gravity is the principle that applications and services tend to migrate toward large data stores because moving data is expensive and slow. This is especially relevant for ML where training datasets can be petabytes.
Implications for AI infrastructure:
- Train where data lives: Moving 100 TB of training data across clouds costs ~$9,000 in egress fees and takes days. It is cheaper and faster to provision GPUs in the same cloud/region as your data.
- Data preprocessing locality: Run feature engineering and data augmentation in the same region as raw data. Only move processed, compressed datasets if cross-region training is necessary.
- Model serving can be distributed: Model artifacts are typically 1–100 GB, easily replicated across regions. Serve models close to users regardless of where training happened.
Strategies to manage data gravity:
- Data lakehouse architecture: Centralize data in a cloud data lake (Delta Lake, Iceberg). Run training on compute in the same cloud. Use data lake features (partitioning, Z-ordering) to efficiently read only the data needed for each training run.
- Edge caching: For multi-region training, cache frequently accessed datasets in each region. Use delta sync to keep caches updated.
- Federated learning: When data cannot move (privacy, regulation), bring the model to the data. Train models locally and aggregate updates centrally.
- Data sampling: For development and experimentation, use representative data samples (1–10%) that can be easily moved. Only use full datasets for final training runs.
Lilly Tech Systems