Intermediate

Model Deployment

Learn deployment patterns, serving frameworks, containerization, Kubernetes for ML, and safe rollout strategies.

Deployment Patterns

Batch Inference

Run predictions on large datasets periodically (hourly, daily). Best for non-time-sensitive use cases like recommendation lists, risk scores, or report generation.

Python — Batch inference pipeline
import mlflow
import pandas as pd

# Load production model
model = mlflow.pyfunc.load_model("models:/churn-model/Production")

# Load batch data
batch_data = pd.read_parquet("s3://data/daily_features.parquet")

# Generate predictions
predictions = model.predict(batch_data)

# Save results
results = batch_data.assign(churn_probability=predictions)
results.to_parquet("s3://data/predictions/daily_predictions.parquet")

Real-Time API

Serve predictions via a REST API with low-latency responses (milliseconds). Required for user-facing features like search ranking, fraud detection, or chatbots.

Python — FastAPI model serving
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow

app = FastAPI()
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")

class PredictionRequest(BaseModel):
    amount: float
    merchant_category: str
    time_since_last_txn: float
    is_foreign: bool

class PredictionResponse(BaseModel):
    fraud_probability: float
    is_fraud: bool

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    features = pd.DataFrame([request.dict()])
    prob = model.predict(features)[0]
    return PredictionResponse(
        fraud_probability=float(prob),
        is_fraud=prob > 0.5
    )

Edge Deployment

Deploy models directly on devices (phones, IoT, browsers). Requires model optimization (quantization, pruning, distillation) for constrained environments.

Serving Frameworks

FrameworkBest ForKey Features
TF ServingTensorFlow modelsgRPC + REST, batching, model versioning
TorchServePyTorch modelsMulti-model serving, A/B testing, metrics
TritonMulti-frameworkDynamic batching, GPU optimization, concurrent models
BentoMLAny frameworkEasy packaging, adaptive batching, cloud deploy
Seldon CoreKubernetes-nativeAdvanced routing, explainability, drift detection

Containerization with Docker

Dockerfile — ML model container
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and serving code
COPY model/ ./model/
COPY serve.py .

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run server
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]
Shell — Build and run
# Build the container
docker build -t ml-model:v1.0 .

# Run locally
docker run -p 8000:8000 ml-model:v1.0

# Test
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"amount": 500, "merchant_category": "electronics", "time_since_last_txn": 3.5, "is_foreign": true}'

Kubernetes for ML

YAML — Kubernetes deployment for ML model
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
      - name: model
        image: ml-model:v1.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: fraud-detector-service
spec:
  selector:
    app: fraud-detector
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Serverless ML

Deploy models as serverless functions for variable workloads with automatic scaling to zero:

  • AWS Lambda: Up to 10GB container images, 15-minute timeout.
  • Google Cloud Functions: Event-driven, auto-scaling.
  • Azure Functions: Integrated with Azure ML.
When to use serverless: Great for infrequent predictions, prototypes, or lightweight models. Avoid for latency-sensitive applications (cold start) or large models (memory limits).

Safe Rollout Strategies

A/B Testing

Route a percentage of traffic to the new model and compare metrics against the existing model. Requires statistical significance testing.

Canary Deployments

Gradually increase traffic to the new model: 1% → 5% → 25% → 100%. Roll back immediately if metrics degrade.

Blue/Green Deployments

Maintain two identical environments. Switch all traffic at once from blue (current) to green (new). Instant rollback by switching back.

StrategyRiskRollback SpeedBest For
A/B TestingLowInstantMeasuring business impact
CanaryVery lowInstantGradual confidence building
Blue/GreenMediumInstantSimple, full-switch deployments
ShadowNoneN/ATesting without user impact