Best Practices & Checklist
Everything you need to review before, during, and after launching an ML inference system in production. This lesson consolidates the course into actionable checklists, cost calculations, SLA templates, and answers to the most common questions engineers have about production ML serving.
Production Inference Checklist
Use this checklist before every model deployment. Each item maps back to a lesson in this course.
Before Deployment
- Model optimization: Have you applied FP16/BF16 at minimum? Consider INT8 or INT4 quantization for production workloads. (Lesson 3)
- Benchmark on target hardware: Measure latency at P50, P95, P99 with expected batch sizes on the actual GPU you will deploy to. Lab numbers are meaningless.
- Model warm-up configured: Triton
model_warmupor equivalent to avoid cold-start latency on first request. (Lesson 2) - Dynamic batching tuned: Set
max_batch_sizeandmax_queue_delaybased on your latency SLA. (Lesson 4) - Health checks configured: Liveness, readiness, and startup probes with appropriate timeouts for model loading. (Lesson 2)
- Resource requests/limits set: GPU count, memory, CPU in Kubernetes manifests. Never run without resource limits.
- Rollback plan documented: Know exactly how to revert to the previous model version in under 60 seconds. (Lesson 6)
During Deployment
- Shadow deploy first: Run the new model alongside the old one for 24-72 hours. Compare predictions. (Lesson 6)
- Canary with monitoring: Start at 1-5% traffic. Define automatic rollback triggers for error rate, latency, and model quality. (Lesson 6)
- Watch P99, not P50: Average latency hides tail latency problems. Your SLA should be defined at P99 or P99.9.
- Monitor GPU memory: Check for memory leaks. GPU OOM kills are silent and catastrophic.
- Monitor model quality: Track online metrics (CTR, accuracy, error rate) alongside system metrics.
After Deployment
- Cost per request tracking: Calculate and log the actual cost per inference request (see calculator below).
- Autoscaling validated: Confirm that scale-up triggers fire within your acceptable cold-start window. Validate scale-down does not remove pods prematurely.
- Alert thresholds set: Configure alerts for latency SLA breaches, GPU utilization spikes, queue depth, error rate, and model staleness.
- Runbook written: Document troubleshooting steps for common failure modes: GPU OOM, model loading failures, latency spikes, traffic overload.
Cost Per Request Calculator
Understanding your cost per inference request is critical for capacity planning and pricing decisions.
# Cost per request calculator
def calculate_cost_per_request(
gpu_hourly_cost: float, # $/hour for GPU instance
num_gpus: int, # Number of GPUs in cluster
requests_per_second: float, # Average QPS
utilization: float = 0.7, # Average GPU utilization
) -> dict:
"""Calculate the cost per inference request."""
# Total cost per second
cost_per_second = (gpu_hourly_cost * num_gpus) / 3600
# Cost per request
cost_per_request = cost_per_second / requests_per_second
# Cost per 1M requests
cost_per_million = cost_per_request * 1_000_000
# Monthly cost
monthly_hours = 730 # avg hours per month
monthly_cost = gpu_hourly_cost * num_gpus * monthly_hours
# Cost efficiency (requests per dollar)
requests_per_dollar = 1 / cost_per_request
return {
"cost_per_request": f"${cost_per_request:.6f}",
"cost_per_1M_requests": f"${cost_per_million:.2f}",
"monthly_gpu_cost": f"${monthly_cost:,.0f}",
"requests_per_dollar": f"{requests_per_dollar:,.0f}",
"effective_cost_at_utilization": f"${cost_per_request/utilization:.6f}",
}
# Example scenarios:
print("=== Small model on A10G ===")
print(calculate_cost_per_request(
gpu_hourly_cost=1.01, # g5.xlarge on-demand
num_gpus=2,
requests_per_second=500,
))
# cost_per_request: $0.000001
# cost_per_1M_requests: $1.12
# monthly_gpu_cost: $1,475
print("\n=== LLM on A100 (70B model) ===")
print(calculate_cost_per_request(
gpu_hourly_cost=3.67, # a2-highgpu-1g
num_gpus=4,
requests_per_second=10, # 10 QPS for 70B model
))
# cost_per_request: $0.000408
# cost_per_1M_requests: $407.78
# monthly_gpu_cost: $10,716
print("\n=== LLM on A100 with spot instances (70% savings) ===")
print(calculate_cost_per_request(
gpu_hourly_cost=1.10, # spot pricing
num_gpus=4,
requests_per_second=10,
))
# cost_per_request: $0.000122
# cost_per_1M_requests: $122.22
# monthly_gpu_cost: $3,212
SLA Design Template
Every production inference system needs a clearly defined SLA. Use this template:
| SLA Dimension | Target | Measurement | Consequence of Breach |
|---|---|---|---|
| Availability | 99.9% (8.7 hrs downtime/year) | Successful responses / Total requests, rolling 30 days | Page on-call engineer |
| Latency (P50) | < 30ms | Histogram of response times, measured at client | Investigate, optimize batch config |
| Latency (P99) | < 100ms | 99th percentile response time, rolling 5 min window | Auto-scale up, alert on-call |
| Throughput | > 1000 QPS sustained | Requests per second, measured at load balancer | Scale GPU cluster |
| Error Rate | < 0.1% | 5xx responses / Total responses, rolling 15 min | Rollback if new model, page if infrastructure |
| Model Freshness | < 24 hours | Time since last model update deployed | Alert ML team, check training pipeline |
Essential Monitoring
These are the metrics you must have dashboards and alerts for in any production inference system:
# Prometheus metrics to expose from your inference service
# 1. Request metrics
inference_request_total{model, version, status} # Counter
inference_request_duration_seconds{model, version, quantile} # Histogram
inference_request_queue_depth{model} # Gauge
# 2. GPU metrics (from DCGM exporter)
DCGM_FI_DEV_GPU_UTIL{gpu, instance} # GPU utilization %
DCGM_FI_DEV_FB_USED{gpu, instance} # GPU memory used (MB)
DCGM_FI_DEV_FB_FREE{gpu, instance} # GPU memory free (MB)
DCGM_FI_DEV_GPU_TEMP{gpu, instance} # GPU temperature
# 3. Model metrics
model_prediction_value{model, version} # Histogram of outputs
model_prediction_confidence{model, version} # Avg confidence score
model_input_token_count{model} # For LLMs
# 4. Business metrics
model_click_through_rate{model, version} # CTR for ranking models
model_conversion_rate{model, version} # For recommendation models
model_false_positive_rate{model, version} # For classification models
# Alert rules (Prometheus alerting rules)
groups:
- name: inference-alerts
rules:
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(inference_request_duration_seconds_bucket[5m])
) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency > 100ms for {{ $labels.model }}"
- alert: HighErrorRate
expr: |
rate(inference_request_total{status="error"}[5m])
/ rate(inference_request_total[5m]) > 0.001
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate > 0.1% for {{ $labels.model }}"
- alert: GPUMemoryNearFull
expr: |
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
> 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU memory > 95% on {{ $labels.instance }}"
- alert: QueueDepthHigh
expr: inference_request_queue_depth > 50
for: 2m
labels:
severity: warning
annotations:
summary: "Inference queue depth > 50 for {{ $labels.model }}"
Frequently Asked Questions
How do I choose between Triton, vLLM, and TorchServe?
Use vLLM if you are serving LLMs (text generation) — it has the best throughput due to PagedAttention and continuous batching. Use Triton if you serve multiple models, need multi-framework support, or serve non-LLM models (vision, embeddings, ranking). Use TorchServe if you are a PyTorch-only shop and want the simplest setup. For most teams in 2025, the answer is vLLM for LLMs and Triton for everything else.
How much does quantization actually hurt model quality?
FP16 has near-zero quality loss and should always be your baseline. INT8 post-training quantization typically costs 0.2-1% accuracy on classification tasks. 4-bit quantization (AWQ, GPTQ) for LLMs typically shows 1-3% degradation on benchmarks like MMLU, but is often imperceptible in production use. Always benchmark on your specific task — some models are more sensitive than others. If quality is critical, use INT8 with quantization-aware training for <0.3% loss.
What is the minimum GPU I need for serving a 70B parameter LLM?
In FP16, a 70B model needs ~140GB GPU memory, requiring 2x A100 80GB. With AWQ 4-bit quantization, it fits on 1x A100 80GB (~35GB for weights + KV cache). For highest throughput, use 2x A100 with tensor parallelism even with quantization. On a budget, an A6000 48GB can run a 70B model in 4-bit with limited concurrency. For development/testing, use 2x RTX 4090 24GB with GPTQ 4-bit.
How do I handle cold starts with GPU autoscaling?
The three proven approaches: (1) Warm pool — keep pre-loaded standby pods that can be promoted instantly by flipping a label. (2) Local model cache — use a DaemonSet to pre-cache model weights on node-local SSDs, cutting startup from 3-5 minutes to 15-45 seconds. (3) Predictive scaling — use historical traffic patterns to scale up 15 minutes before predicted traffic spikes. Most teams use a combination of all three.
When should I use batch inference vs real-time inference?
Use batch inference when (a) the user is not waiting for the result, (b) you can tolerate predictions that are hours old, and (c) you want to minimize cost. Use real-time inference when the prediction is on the user-facing critical path. Many systems use both: batch-precompute recommendations for all users nightly, but re-rank in real-time when a user opens the app. Start with batch (simpler, cheaper) and only add real-time when product requirements demand it.
How do I know if my A/B test has enough data?
Use a sample size calculator before starting. For a two-sided test with 80% power and 5% significance level, detecting a 1% relative lift in CTR (baseline 10%) requires ~150,000 observations per variant. For smaller effects or lower-variance metrics, you need more. As a rule of thumb: run your A/B test for at least 1 full business cycle (7 days) even if you hit statistical significance earlier, to account for day-of-week effects.
What is the best way to reduce inference costs by 50% or more?
In order of impact: (1) Model routing — send 70% of requests to a small model, 30% to the large model (60-80% savings). (2) Spot instances — use spot for burst capacity (60-70% per-instance savings). (3) Quantization — 4-bit quantization halves your GPU count. (4) Batching optimization — better batching doubles throughput per GPU. (5) Scale-to-zero for low-traffic models. Combining all five can reduce costs by 80-90%.
Should I use gRPC or HTTP for inference requests?
Use gRPC for service-to-service communication where you control both client and server. gRPC is 2-5x faster than HTTP/JSON due to binary serialization (protobuf), HTTP/2 multiplexing, and streaming support. Use HTTP/REST for public-facing APIs, browser clients, or when you need maximum compatibility. Triton and most model servers support both. For LLM streaming, use gRPC or Server-Sent Events (SSE) over HTTP.
Course Summary
You have completed the Designing Real-Time ML Inference course. Here is a recap of the key concepts from each lesson:
| Lesson | Key Takeaway |
|---|---|
| 1. Architecture Overview | Choose batch, near-real-time, or real-time based on whether the user is waiting. Design for P99 latency, not P50. |
| 2. Model Server Design | Use Triton for multi-model, vLLM for LLMs. Always configure warm-up and health checks. Estimate GPU memory before deployment. |
| 3. Optimization | Start with FP16, add torch.compile for free speedup, then TensorRT for maximum performance. AWQ 4-bit is the standard for LLM deployment. |
| 4. Batching & Routing | Dynamic batching gives 5-10x throughput. Continuous batching is essential for LLMs. Model routing cuts costs by 60-80%. |
| 5. Autoscaling | Scale on queue depth and GPU utilization, not CPU. Mitigate cold starts with warm pools and model caching. Use spot for burst capacity. |
| 6. A/B Testing | Shadow deploy first, then canary at 1-5%, then A/B test for statistical validation. Always keep the old model loaded for instant rollback. |
| 7. Best Practices | Track cost per request, define SLAs at P99, monitor GPU memory and model quality, and use the deployment checklist every time. |
Lilly Tech Systems