Advanced

Minimizing Cold Start Latency

Understand what causes cold starts in serverless AI inference and apply proven strategies to reduce initialization time from seconds to milliseconds.

Anatomy of an AI Cold Start

Cold starts in ML serverless functions are typically 5-30x longer than standard cold starts because of model loading. Here is the breakdown:

Container/Runtime Init (1-3s)
The platform provisions a new instance, pulls the container image, and initializes the runtime. This is the standard cold start that affects all serverless functions.
Dependency Loading (2-10s)
ML libraries like PyTorch, TensorFlow, and their dependencies are imported. These libraries are large and have complex initialization routines.
Model Loading (1-30s)
Model weights are loaded from disk or network storage into memory. Large models can take tens of seconds to deserialize and load.
Warm-up Inference (0.5-2s)
First inference is often slower due to JIT compilation, graph optimization, and memory allocation. A warm-up run brings the model to peak performance.

Reduction Strategies

📦

Smaller Container Images

Use slim base images, multi-stage builds, and CPU-only PyTorch. A 2 GB image loads 5x faster than a 10 GB image.

⚡

ONNX Runtime

Convert models to ONNX format. ONNX Runtime loads 3-5x faster than PyTorch and has a smaller dependency footprint.

📈

Model Quantization

Quantize models to INT8 or FP16. Reduces model size by 2-4x, which directly reduces loading time and memory usage.

🔥

Provisioned Concurrency

Keep instances warm with provisioned concurrency (Lambda) or minimum instances (Cloud Run). Eliminates cold starts at the cost of always-on billing.

Cold Start Benchmarks

Strategy	Cold Start	Warm Latency	Cost Impact
PyTorch (full)	15-30s	50-200ms	Baseline
ONNX Runtime	3-8s	20-100ms	Same
ONNX + Quantized	2-5s	15-80ms	Same
Provisioned (warm)	0s	20-100ms	+60-200%

Python - Model Quantization

import onnxruntime as ort
from onnxruntime.quantization import quantize_dynamic, QuantType

# Quantize ONNX model to INT8 for faster loading
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_quantized.onnx",
    weight_type=QuantType.QInt8
)

# Original: 350 MB, loads in 8s
# Quantized: 90 MB, loads in 2s

✅

Best practice: Implement a hybrid approach: use provisioned concurrency for your baseline traffic and let auto-scaling handle spikes with cold starts. Monitor your p99 latency and adjust provisioned concurrency to keep cold starts below your SLA threshold.

← Previous Cloud Run Next → Best Practices

Minimizing Cold Start Latency

Anatomy of an AI Cold Start

Container/Runtime Init (1-3s)

Dependency Loading (2-10s)

Model Loading (1-30s)

Warm-up Inference (0.5-2s)