Model Deployment Questions
These 12 questions cover the most frequently asked model deployment topics in MLOps interviews. Deployment is where most ML projects fail — interviewers want to know you can get models into production reliably.
Q1: How do you containerize an ML model for production deployment?
Containerization packages a model with all its dependencies into a portable unit. The standard approach uses Docker with a multi-stage build:
- Base image: Start with a slim Python image (python:3.11-slim) or a GPU-enabled base (nvidia/cuda:12.1-runtime) if GPU inference is needed. Never use full Ubuntu images — they are 2–3x larger and increase cold start time.
- Dependencies: Copy requirements.txt first, then install. This leverages Docker layer caching so rebuilds only re-install when requirements change. Pin all versions for reproducibility.
- Model artifacts: Copy the serialized model (ONNX, TorchScript, SavedModel) into the container. For large models (multi-GB), consider downloading from a model registry at startup to keep the container image small.
- Entrypoint: Use a production ASGI server like uvicorn with FastAPI, or a dedicated serving framework like TorchServe or Triton. Never use Flask's development server in production.
- Health checks: Include /health and /ready endpoints. Health checks verify the process is running; readiness checks verify the model is loaded and can serve requests.
Key trade-off: Baking the model into the image gives fast startup but creates large images (5–20 GB for LLMs). Loading from S3/GCS at startup keeps images small but adds 30–120 seconds to cold start. Choose based on your scaling pattern and tolerance for cold starts.
Q2: Compare TensorFlow Serving, Triton Inference Server, and BentoML. When would you use each?
| Feature | TF Serving | Triton | BentoML |
|---|---|---|---|
| Framework Support | TensorFlow only | TF, PyTorch, ONNX, TensorRT, custom | Any Python framework |
| Protocol | gRPC, REST | gRPC, REST, HTTP/2 | REST, gRPC |
| Dynamic Batching | Yes | Yes (advanced) | Yes (adaptive) |
| Model Ensemble | No | Yes (pipeline models) | Yes (service graph) |
| GPU Optimization | Good | Excellent (TensorRT, concurrent model execution) | Basic |
| Complexity | Low | High | Medium |
Use TF Serving when your stack is pure TensorFlow and you want simplicity. Use Triton when you need maximum GPU utilization, multi-framework support, or complex model pipelines (e.g., preprocessing model chained with inference model). Use BentoML when you want Python-native packaging with less infrastructure overhead, or when your team is small and needs to move fast.
Q3: What is the difference between online inference, batch inference, and streaming inference?
- Online (real-time) inference: Single request, synchronous response. Latency target: 10–500 ms. Examples: fraud detection at payment time, search ranking, chatbot responses. Requires always-on serving infrastructure (load balancer, autoscaler, GPU instances).
- Batch inference: Process large datasets on a schedule. Latency target: minutes to hours. Examples: nightly product recommendations, weekly churn predictions, daily content moderation sweeps. Run on ephemeral compute (Spark jobs, Kubernetes batch jobs) to save cost. Results stored in a database or feature store.
- Streaming inference: Process events as they arrive from a stream (Kafka, Kinesis). Latency target: 100 ms – 5 seconds. Examples: real-time anomaly detection in IoT sensor data, live content moderation, dynamic pricing. Requires stream processing (Flink, Kafka Streams) integrated with model serving.
Decision framework: Choose batch when latency tolerance is high and cost must be low. Choose online when user-facing latency matters. Choose streaming when you need near-real-time on continuous data without the infrastructure cost of always-on endpoints. Many production systems combine all three — batch for bulk, online for user-facing, streaming for event-driven.
Q4: How do you design a REST API for model inference? What should the request/response contract look like?
A well-designed ML API should be versioned, documented, and include metadata for debugging:
# POST /v1/models/fraud-detector/predict
# Request
{
"request_id": "uuid-123",
"inputs": {
"transaction_amount": 499.99,
"merchant_category": "electronics",
"user_country": "US",
"time_since_last_transaction_seconds": 120
},
"parameters": {
"threshold": 0.7,
"return_explanations": true
}
}
# Response
{
"request_id": "uuid-123",
"model_name": "fraud-detector",
"model_version": "v2.3.1",
"predictions": {
"fraud_probability": 0.82,
"is_fraud": true,
"explanations": {
"transaction_amount": 0.35,
"time_since_last_transaction": 0.28,
"merchant_category": 0.19
}
},
"metadata": {
"latency_ms": 12,
"timestamp": "2026-03-21T10:30:00Z"
}
}
Key design principles: (1) Include request_id for tracing across services, (2) Return model_version so you can correlate predictions with specific model artifacts, (3) Support configurable thresholds so business logic does not require redeployment, (4) Include latency in the response for client-side monitoring, (5) Use semantic versioning in the URL path (/v1/, /v2/) for backward compatibility.
Q5: Explain blue-green deployment for ML models. How does it differ from canary deployment?
Blue-green deployment: Run two identical production environments. "Blue" serves current traffic. Deploy the new model to "Green." Run validation tests against Green. When satisfied, switch the load balancer to route all traffic from Blue to Green instantly. If something goes wrong, switch back to Blue in seconds.
Canary deployment: Gradually route a small percentage of traffic (1%, then 5%, then 25%, then 100%) to the new model while monitoring key metrics. If metrics degrade at any step, roll back automatically. More gradual than blue-green but catches issues that only appear under real production traffic patterns.
| Aspect | Blue-Green | Canary |
|---|---|---|
| Traffic Switch | All-at-once | Gradual (1% → 5% → 25% → 100%) |
| Rollback Speed | Instant (switch LB) | Fast (route back to old) |
| Blast Radius | All users at switch | Only canary percentage |
| Cost | 2x infrastructure | 1x + canary overhead |
| Best For | High-confidence releases, simple models | Risky model changes, A/B testing needed |
ML-specific consideration: For ML models, canary is usually preferred because model behavior can differ significantly between test data and real production traffic. A model that passes all offline tests might still produce poor predictions on edge cases only seen in production. Canary lets you catch these before all users are affected.
Q6: How do you handle model versioning in production?
Model versioning requires tracking three things: the model artifact, the code that trained it, and the data it was trained on. All three must be versioned together because changing any one can change model behavior.
- Model artifacts: Store in a model registry (MLflow, Weights & Biases, Vertex AI Model Registry) with semantic versioning. Each version records: training metrics, evaluation results, training data hash, code commit SHA, and hyperparameters.
- Code versioning: Tag the Git commit that produced each model version. The training script, preprocessing code, and feature engineering logic must all be traceable to a specific commit.
- Data versioning: Use DVC (Data Version Control) or Delta Lake versioning. Store a hash of the training dataset alongside the model. This enables reproducing any model version exactly.
- Deployment versioning: Maintain a deployment manifest that maps each serving endpoint to a specific model version. Use GitOps (ArgoCD, Flux) to manage these manifests so all deployment changes are tracked in Git.
Anti-pattern to avoid: Naming models by date ("model_20260321") or incrementing integers ("model_v47"). These do not convey semantic meaning about what changed. Use semantic versioning: major (breaking changes to API/features), minor (retraining with new data), patch (bug fixes or minor tuning).
Q7: What is model serialization? Compare pickle, ONNX, TorchScript, and SavedModel formats.
Model serialization converts a trained model from its in-memory representation to a format that can be stored, transferred, and loaded for inference.
| Format | Framework | Pros | Cons |
|---|---|---|---|
| Pickle (.pkl) | Any Python | Simple, preserves entire object | Security risk (arbitrary code execution), Python-version dependent, not optimized for inference |
| ONNX (.onnx) | Cross-framework | Framework-agnostic, hardware optimization, wide ecosystem | Not all operations supported, conversion can fail for custom layers |
| TorchScript (.pt) | PyTorch | No Python dependency at runtime, JIT compilation, mobile support | PyTorch-only, tracing can miss dynamic control flow |
| SavedModel | TensorFlow | Full graph + weights, TF Serving compatible, TF Lite conversion | TensorFlow-only, large file size |
Production recommendation: Use ONNX when you need framework flexibility or plan to run on diverse hardware (CPU, GPU, edge). Use TorchScript when staying in PyTorch ecosystem and need C++ inference. Never use pickle in production — it is a security vulnerability and breaks across Python versions.
Q8: How does dynamic batching work and why is it critical for GPU inference?
Dynamic batching collects individual inference requests over a short time window (e.g., 5–50 ms) and combines them into a single batch before sending to the GPU. This is critical because:
- GPU parallelism: GPUs are designed for parallel computation. Processing batch_size=1 uses maybe 5% of the GPU's compute capacity. Processing batch_size=32 might use 80%. Without batching, you pay for a $10K GPU but use 5% of it.
- Throughput improvement: A model that takes 10 ms for a single request might take 15 ms for a batch of 32. That is 32 predictions in 15 ms vs. 32 predictions in 320 ms — a 21x throughput improvement.
- Cost efficiency: Higher throughput means fewer GPU instances needed. If you can serve 10x more requests per GPU, you need 10x fewer GPUs, saving thousands of dollars per month.
Configuration trade-offs: A larger batch size increases throughput but also increases latency (requests wait longer to fill the batch). Set a maximum wait time (e.g., 10 ms) and a maximum batch size (e.g., 64). Whichever limit is hit first triggers the batch. Triton Inference Server has excellent built-in dynamic batching with configurable preferred batch sizes and max queue delay.
Q9: What is shadow deployment (shadow mode) and when should you use it?
Shadow deployment runs a new model alongside the production model, sending it the same live traffic, but the shadow model's predictions are only logged — never returned to users. The production model continues serving all actual responses.
Use cases:
- Validating a new model on production traffic without any risk to users. You can compare shadow vs. production predictions offline.
- Measuring latency under real load before promoting to production. Synthetic load tests often miss real-world traffic patterns.
- Collecting ground truth labels for the new model's predictions. In systems where you get delayed feedback (e.g., fraud detected days later), shadow mode lets you accumulate labeled data.
- Regulatory compliance: In healthcare or finance, you may need to demonstrate that a new model performs at least as well as the current model on real data before switching.
Implementation: Use an Envoy or Istio service mesh to mirror traffic. The shadow receives a copy of every request but its response is discarded. Log shadow predictions to a data warehouse for offline comparison. Watch for: doubled infrastructure cost and doubled downstream calls (if the model calls other services).
Q10: How do you reduce model inference latency in production?
Latency optimization is a multi-layer problem. Start with the highest-impact, lowest-effort changes:
- Model optimization (highest impact): Quantize from FP32 to INT8 (2–4x speedup with 1–2% accuracy loss). Use ONNX Runtime or TensorRT for graph optimization. Distill a large model into a smaller one. Prune unnecessary weights.
- Serving optimization: Enable dynamic batching. Use gRPC instead of REST (2–3x faster for small payloads). Implement request/response compression. Pre-load models at container startup, not on first request.
- Infrastructure optimization: Use GPU instances matched to your model size. Place inference servers in the same region/zone as callers. Use connection pooling. Consider edge deployment for latency-critical applications.
- Architecture optimization: Cache frequent predictions (LRU cache for repeated inputs). Pre-compute embeddings offline. Use a smaller model for easy cases and route only hard cases to the large model (cascade architecture).
- Feature computation: Pre-compute features in a feature store instead of computing at inference time. This is often the largest source of latency — feature lookup from a cache takes 1–5 ms vs. 50–200 ms to compute on the fly.
Always measure before optimizing. Profile your inference pipeline end-to-end: network latency, feature computation, model inference, post-processing. Optimize the bottleneck, not the part you find most interesting.
Q11: What is A/B testing for ML models and how does it differ from canary deployment?
A/B testing splits traffic between two or more model variants to measure which performs better on a business metric. Users are randomly assigned to a group and stay in that group for the duration of the experiment. Statistical significance determines the winner.
Canary deployment gradually rolls out a new model version to catch bugs and performance regressions. The goal is risk mitigation, not experimentation. Traffic split increases over time until the new version handles 100%.
| Aspect | A/B Testing | Canary Deployment |
|---|---|---|
| Goal | Measure which model is better | Safely roll out a known-good model |
| Traffic Split | Fixed (50/50 or custom) | Gradual (1% → 100%) |
| Duration | Days to weeks (for significance) | Hours to days |
| User Assignment | Sticky (same user, same model) | Random per request |
| Success Metric | Business KPI (CTR, revenue) | System health (latency, errors) |
Best practice: Use canary first to validate the new model is healthy, then run an A/B test to measure its business impact. They are complementary, not alternatives.
Q12: How do you handle model rollback in production when something goes wrong?
Model rollback must be fast, automated, and well-rehearsed. Here is a production-grade rollback strategy:
- Automated rollback triggers: Monitor key metrics (latency p99, error rate, prediction distribution shift). If any metric breaches a threshold for more than N minutes, automatically revert to the previous model version. Define these thresholds before deployment, not during an incident.
- Model registry as source of truth: Every deployed model version is stored in the registry (MLflow, Vertex AI). Rollback means updating the serving endpoint to point to the previous version. The previous model artifact is always available — never delete it after a new deployment.
- Blue-green for instant rollback: Keep the previous model version running on standby infrastructure. Rollback is a load balancer config change that takes seconds. More expensive but essential for critical systems (payment fraud, healthcare).
- GitOps rollback: If deployments are managed via GitOps (ArgoCD), rollback means reverting the Git commit that changed the model version. This provides an audit trail and can be done in one command:
git revert <commit>. - Runbook and drills: Document the rollback procedure step by step. Practice it quarterly. During an incident is not the time to figure out how rollback works.
Anti-pattern: Rolling forward under pressure. If the new model is causing problems, do not try to fix it live. Roll back first, restore service, then debug the issue offline. Every minute of degraded service costs user trust.
Lilly Tech Systems