Intermediate

Model Deployment

Learn deployment patterns, serving frameworks, containerization, Kubernetes for ML, and safe rollout strategies.

Deployment Patterns

Batch Inference

Run predictions on large datasets periodically (hourly, daily). Best for non-time-sensitive use cases like recommendation lists, risk scores, or report generation.

Python — Batch inference pipeline

import mlflow
import pandas as pd

# Load production model
model = mlflow.pyfunc.load_model("models:/churn-model/Production")

# Load batch data
batch_data = pd.read_parquet("s3://data/daily_features.parquet")

# Generate predictions
predictions = model.predict(batch_data)

# Save results
results = batch_data.assign(churn_probability=predictions)
results.to_parquet("s3://data/predictions/daily_predictions.parquet")

Real-Time API

Serve predictions via a REST API with low-latency responses (milliseconds). Required for user-facing features like search ranking, fraud detection, or chatbots.

Python — FastAPI model serving

from fastapi import FastAPI
from pydantic import BaseModel
import mlflow

app = FastAPI()
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")

class PredictionRequest(BaseModel):
    amount: float
    merchant_category: str
    time_since_last_txn: float
    is_foreign: bool

class PredictionResponse(BaseModel):
    fraud_probability: float
    is_fraud: bool

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = pd.DataFrame([request.dict()])
    prob = model.predict(features)[0]
    return PredictionResponse(
        fraud_probability=float(prob),
        is_fraud=prob > 0.5
    )

Edge Deployment

Deploy models directly on devices (phones, IoT, browsers). Requires model optimization (quantization, pruning, distillation) for constrained environments.

Serving Frameworks

Framework	Best For	Key Features
TF Serving	TensorFlow models	gRPC + REST, batching, model versioning
TorchServe	PyTorch models	Multi-model serving, A/B testing, metrics
Triton	Multi-framework	Dynamic batching, GPU optimization, concurrent models
BentoML	Any framework	Easy packaging, adaptive batching, cloud deploy
Seldon Core	Kubernetes-native	Advanced routing, explainability, drift detection

Containerization with Docker

Dockerfile — ML model container

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and serving code
COPY model/ ./model/
COPY serve.py .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run server
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]

Shell — Build and run

# Build the container
docker build -t ml-model:v1.0 .

# Run locally
docker run -p 8000:8000 ml-model:v1.0

# Test
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "time_since_last_txn": 3.5, "is_foreign": true}'

Kubernetes for ML

YAML — Kubernetes deployment for ML model

apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
      - name: model
        image: ml-model:v1.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: fraud-detector-service
spec:
  selector:
    app: fraud-detector
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Serverless ML

Deploy models as serverless functions for variable workloads with automatic scaling to zero:

AWS Lambda: Up to 10GB container images, 15-minute timeout.
Google Cloud Functions: Event-driven, auto-scaling.
Azure Functions: Integrated with Azure ML.

✅

When to use serverless: Great for infrequent predictions, prototypes, or lightweight models. Avoid for latency-sensitive applications (cold start) or large models (memory limits).

Safe Rollout Strategies

A/B Testing

Route a percentage of traffic to the new model and compare metrics against the existing model. Requires statistical significance testing.

Canary Deployments

Gradually increase traffic to the new model: 1% → 5% → 25% → 100%. Roll back immediately if metrics degrade.

Blue/Green Deployments

Maintain two identical environments. Switch all traffic at once from blue (current) to green (new). Instant rollback by switching back.

Strategy	Risk	Rollback Speed	Best For
A/B Testing	Low	Instant	Measuring business impact
Canary	Very low	Instant	Gradual confidence building
Blue/Green	Medium	Instant	Simple, full-switch deployments
Shadow	None	N/A	Testing without user impact

← Previous Model Training Next → Monitoring