Advanced

Best Practices & Checklist

Everything you need to review before, during, and after launching an ML inference system in production. This lesson consolidates the course into actionable checklists, cost calculations, SLA templates, and answers to the most common questions engineers have about production ML serving.

Production Inference Checklist

Use this checklist before every model deployment. Each item maps back to a lesson in this course.

Before Deployment

Model optimization: Have you applied FP16/BF16 at minimum? Consider INT8 or INT4 quantization for production workloads. (Lesson 3)
Benchmark on target hardware: Measure latency at P50, P95, P99 with expected batch sizes on the actual GPU you will deploy to. Lab numbers are meaningless.
Model warm-up configured: Triton model_warmup or equivalent to avoid cold-start latency on first request. (Lesson 2)
Dynamic batching tuned: Set max_batch_size and max_queue_delay based on your latency SLA. (Lesson 4)
Health checks configured: Liveness, readiness, and startup probes with appropriate timeouts for model loading. (Lesson 2)
Resource requests/limits set: GPU count, memory, CPU in Kubernetes manifests. Never run without resource limits.
Rollback plan documented: Know exactly how to revert to the previous model version in under 60 seconds. (Lesson 6)

During Deployment

Shadow deploy first: Run the new model alongside the old one for 24-72 hours. Compare predictions. (Lesson 6)
Canary with monitoring: Start at 1-5% traffic. Define automatic rollback triggers for error rate, latency, and model quality. (Lesson 6)
Watch P99, not P50: Average latency hides tail latency problems. Your SLA should be defined at P99 or P99.9.
Monitor GPU memory: Check for memory leaks. GPU OOM kills are silent and catastrophic.
Monitor model quality: Track online metrics (CTR, accuracy, error rate) alongside system metrics.

After Deployment

Cost per request tracking: Calculate and log the actual cost per inference request (see calculator below).
Autoscaling validated: Confirm that scale-up triggers fire within your acceptable cold-start window. Validate scale-down does not remove pods prematurely.
Alert thresholds set: Configure alerts for latency SLA breaches, GPU utilization spikes, queue depth, error rate, and model staleness.
Runbook written: Document troubleshooting steps for common failure modes: GPU OOM, model loading failures, latency spikes, traffic overload.

Cost Per Request Calculator

Understanding your cost per inference request is critical for capacity planning and pricing decisions.

# Cost per request calculator
def calculate_cost_per_request(
    gpu_hourly_cost: float,        # $/hour for GPU instance
    num_gpus: int,                 # Number of GPUs in cluster
    requests_per_second: float,    # Average QPS
    utilization: float = 0.7,      # Average GPU utilization
) -> dict:
    """Calculate the cost per inference request."""
    # Total cost per second
    cost_per_second = (gpu_hourly_cost * num_gpus) / 3600

    # Cost per request
    cost_per_request = cost_per_second / requests_per_second

    # Cost per 1M requests
    cost_per_million = cost_per_request * 1_000_000

    # Monthly cost
    monthly_hours = 730  # avg hours per month
    monthly_cost = gpu_hourly_cost * num_gpus * monthly_hours

    # Cost efficiency (requests per dollar)
    requests_per_dollar = 1 / cost_per_request

    return {
        "cost_per_request": f"${cost_per_request:.6f}",
        "cost_per_1M_requests": f"${cost_per_million:.2f}",
        "monthly_gpu_cost": f"${monthly_cost:,.0f}",
        "requests_per_dollar": f"{requests_per_dollar:,.0f}",
        "effective_cost_at_utilization": f"${cost_per_request/utilization:.6f}",
    }

# Example scenarios:
print("=== Small model on A10G ===")
print(calculate_cost_per_request(
    gpu_hourly_cost=1.01,  # g5.xlarge on-demand
    num_gpus=2,
    requests_per_second=500,
))
# cost_per_request: $0.000001
# cost_per_1M_requests: $1.12
# monthly_gpu_cost: $1,475

print("\n=== LLM on A100 (70B model) ===")
print(calculate_cost_per_request(
    gpu_hourly_cost=3.67,  # a2-highgpu-1g
    num_gpus=4,
    requests_per_second=10,  # 10 QPS for 70B model
))
# cost_per_request: $0.000408
# cost_per_1M_requests: $407.78
# monthly_gpu_cost: $10,716

print("\n=== LLM on A100 with spot instances (70% savings) ===")
print(calculate_cost_per_request(
    gpu_hourly_cost=1.10,  # spot pricing
    num_gpus=4,
    requests_per_second=10,
))
# cost_per_request: $0.000122
# cost_per_1M_requests: $122.22
# monthly_gpu_cost: $3,212

SLA Design Template

Every production inference system needs a clearly defined SLA. Use this template:

SLA Dimension	Target	Measurement	Consequence of Breach
Availability	99.9% (8.7 hrs downtime/year)	Successful responses / Total requests, rolling 30 days	Page on-call engineer
Latency (P50)	< 30ms	Histogram of response times, measured at client	Investigate, optimize batch config
Latency (P99)	< 100ms	99th percentile response time, rolling 5 min window	Auto-scale up, alert on-call
Throughput	> 1000 QPS sustained	Requests per second, measured at load balancer	Scale GPU cluster
Error Rate	< 0.1%	5xx responses / Total responses, rolling 15 min	Rollback if new model, page if infrastructure
Model Freshness	< 24 hours	Time since last model update deployed	Alert ML team, check training pipeline

Essential Monitoring

These are the metrics you must have dashboards and alerts for in any production inference system:

# Prometheus metrics to expose from your inference service

# 1. Request metrics
inference_request_total{model, version, status}          # Counter
inference_request_duration_seconds{model, version, quantile}  # Histogram
inference_request_queue_depth{model}                      # Gauge

# 2. GPU metrics (from DCGM exporter)
DCGM_FI_DEV_GPU_UTIL{gpu, instance}                      # GPU utilization %
DCGM_FI_DEV_FB_USED{gpu, instance}                       # GPU memory used (MB)
DCGM_FI_DEV_FB_FREE{gpu, instance}                       # GPU memory free (MB)
DCGM_FI_DEV_GPU_TEMP{gpu, instance}                      # GPU temperature

# 3. Model metrics
model_prediction_value{model, version}                    # Histogram of outputs
model_prediction_confidence{model, version}               # Avg confidence score
model_input_token_count{model}                            # For LLMs

# 4. Business metrics
model_click_through_rate{model, version}                  # CTR for ranking models
model_conversion_rate{model, version}                     # For recommendation models
model_false_positive_rate{model, version}                 # For classification models

# Alert rules (Prometheus alerting rules)
groups:
  - name: inference-alerts
    rules:
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(inference_request_duration_seconds_bucket[5m])
          ) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency > 100ms for {{ $labels.model }}"

      - alert: HighErrorRate
        expr: |
          rate(inference_request_total{status="error"}[5m])
          / rate(inference_request_total[5m]) > 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 0.1% for {{ $labels.model }}"

      - alert: GPUMemoryNearFull
        expr: |
          DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
          > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU memory > 95% on {{ $labels.instance }}"

      - alert: QueueDepthHigh
        expr: inference_request_queue_depth > 50
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Inference queue depth > 50 for {{ $labels.model }}"

Frequently Asked Questions

How do I choose between Triton, vLLM, and TorchServe?

Use vLLM if you are serving LLMs (text generation) — it has the best throughput due to PagedAttention and continuous batching. Use Triton if you serve multiple models, need multi-framework support, or serve non-LLM models (vision, embeddings, ranking). Use TorchServe if you are a PyTorch-only shop and want the simplest setup. For most teams in 2025, the answer is vLLM for LLMs and Triton for everything else.

How much does quantization actually hurt model quality?

FP16 has near-zero quality loss and should always be your baseline. INT8 post-training quantization typically costs 0.2-1% accuracy on classification tasks. 4-bit quantization (AWQ, GPTQ) for LLMs typically shows 1-3% degradation on benchmarks like MMLU, but is often imperceptible in production use. Always benchmark on your specific task — some models are more sensitive than others. If quality is critical, use INT8 with quantization-aware training for <0.3% loss.

What is the minimum GPU I need for serving a 70B parameter LLM?

In FP16, a 70B model needs ~140GB GPU memory, requiring 2x A100 80GB. With AWQ 4-bit quantization, it fits on 1x A100 80GB (~35GB for weights + KV cache). For highest throughput, use 2x A100 with tensor parallelism even with quantization. On a budget, an A6000 48GB can run a 70B model in 4-bit with limited concurrency. For development/testing, use 2x RTX 4090 24GB with GPTQ 4-bit.

How do I handle cold starts with GPU autoscaling?

The three proven approaches: (1) Warm pool — keep pre-loaded standby pods that can be promoted instantly by flipping a label. (2) Local model cache — use a DaemonSet to pre-cache model weights on node-local SSDs, cutting startup from 3-5 minutes to 15-45 seconds. (3) Predictive scaling — use historical traffic patterns to scale up 15 minutes before predicted traffic spikes. Most teams use a combination of all three.

When should I use batch inference vs real-time inference?

Use batch inference when (a) the user is not waiting for the result, (b) you can tolerate predictions that are hours old, and (c) you want to minimize cost. Use real-time inference when the prediction is on the user-facing critical path. Many systems use both: batch-precompute recommendations for all users nightly, but re-rank in real-time when a user opens the app. Start with batch (simpler, cheaper) and only add real-time when product requirements demand it.

How do I know if my A/B test has enough data?

Use a sample size calculator before starting. For a two-sided test with 80% power and 5% significance level, detecting a 1% relative lift in CTR (baseline 10%) requires ~150,000 observations per variant. For smaller effects or lower-variance metrics, you need more. As a rule of thumb: run your A/B test for at least 1 full business cycle (7 days) even if you hit statistical significance earlier, to account for day-of-week effects.

What is the best way to reduce inference costs by 50% or more?

In order of impact: (1) Model routing — send 70% of requests to a small model, 30% to the large model (60-80% savings). (2) Spot instances — use spot for burst capacity (60-70% per-instance savings). (3) Quantization — 4-bit quantization halves your GPU count. (4) Batching optimization — better batching doubles throughput per GPU. (5) Scale-to-zero for low-traffic models. Combining all five can reduce costs by 80-90%.

Should I use gRPC or HTTP for inference requests?

Use gRPC for service-to-service communication where you control both client and server. gRPC is 2-5x faster than HTTP/JSON due to binary serialization (protobuf), HTTP/2 multiplexing, and streaming support. Use HTTP/REST for public-facing APIs, browser clients, or when you need maximum compatibility. Triton and most model servers support both. For LLM streaming, use gRPC or Server-Sent Events (SSE) over HTTP.

Course Summary

You have completed the Designing Real-Time ML Inference course. Here is a recap of the key concepts from each lesson:

Lesson	Key Takeaway
1. Architecture Overview	Choose batch, near-real-time, or real-time based on whether the user is waiting. Design for P99 latency, not P50.
2. Model Server Design	Use Triton for multi-model, vLLM for LLMs. Always configure warm-up and health checks. Estimate GPU memory before deployment.
3. Optimization	Start with FP16, add torch.compile for free speedup, then TensorRT for maximum performance. AWQ 4-bit is the standard for LLM deployment.
4. Batching & Routing	Dynamic batching gives 5-10x throughput. Continuous batching is essential for LLMs. Model routing cuts costs by 60-80%.
5. Autoscaling	Scale on queue depth and GPU utilization, not CPU. Mitigate cold starts with warm pools and model caching. Use spot for burst capacity.
6. A/B Testing	Shadow deploy first, then canary at 1-5%, then A/B test for statistical validation. Always keep the old model loaded for instant rollback.
7. Best Practices	Track cost per request, define SLAs at P99, monitor GPU memory and model quality, and use the deployment checklist every time.

💡

Keep learning: ML inference is a rapidly evolving field. New hardware (H100, MI300X, Trainium), new serving frameworks, and new optimization techniques emerge regularly. The fundamentals in this course — batching, quantization, autoscaling, safe deployment — remain constant even as implementations change.

← Previous A/B Testing & Canary Back to Course → Course Overview