Model Deployment
Learn deployment patterns, serving frameworks, containerization, Kubernetes for ML, and safe rollout strategies.
Deployment Patterns
Batch Inference
Run predictions on large datasets periodically (hourly, daily). Best for non-time-sensitive use cases like recommendation lists, risk scores, or report generation.
import mlflow
import pandas as pd
# Load production model
model = mlflow.pyfunc.load_model("models:/churn-model/Production")
# Load batch data
batch_data = pd.read_parquet("s3://data/daily_features.parquet")
# Generate predictions
predictions = model.predict(batch_data)
# Save results
results = batch_data.assign(churn_probability=predictions)
results.to_parquet("s3://data/predictions/daily_predictions.parquet")
Real-Time API
Serve predictions via a REST API with low-latency responses (milliseconds). Required for user-facing features like search ranking, fraud detection, or chatbots.
from fastapi import FastAPI
from pydantic import BaseModel
import mlflow
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
class PredictionRequest(BaseModel):
amount: float
merchant_category: str
time_since_last_txn: float
is_foreign: bool
class PredictionResponse(BaseModel):
fraud_probability: float
is_fraud: bool
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
features = pd.DataFrame([request.dict()])
prob = model.predict(features)[0]
return PredictionResponse(
fraud_probability=float(prob),
is_fraud=prob > 0.5
)
Edge Deployment
Deploy models directly on devices (phones, IoT, browsers). Requires model optimization (quantization, pruning, distillation) for constrained environments.
Serving Frameworks
| Framework | Best For | Key Features |
|---|---|---|
| TF Serving | TensorFlow models | gRPC + REST, batching, model versioning |
| TorchServe | PyTorch models | Multi-model serving, A/B testing, metrics |
| Triton | Multi-framework | Dynamic batching, GPU optimization, concurrent models |
| BentoML | Any framework | Easy packaging, adaptive batching, cloud deploy |
| Seldon Core | Kubernetes-native | Advanced routing, explainability, drift detection |
Containerization with Docker
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and serving code
COPY model/ ./model/
COPY serve.py .
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=5s \
CMD curl -f http://localhost:8000/health || exit 1
# Run server
CMD ["uvicorn", "serve:app", "--host", "0.0.0.0", "--port", "8000"]
# Build the container
docker build -t ml-model:v1.0 .
# Run locally
docker run -p 8000:8000 ml-model:v1.0
# Test
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"amount": 500, "merchant_category": "electronics", "time_since_last_txn": 3.5, "is_foreign": true}'
Kubernetes for ML
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-detector
spec:
replicas: 3
selector:
matchLabels:
app: fraud-detector
template:
metadata:
labels:
app: fraud-detector
spec:
containers:
- name: model
image: ml-model:v1.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
nvidia.com/gpu: 1
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: fraud-detector-service
spec:
selector:
app: fraud-detector
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
Serverless ML
Deploy models as serverless functions for variable workloads with automatic scaling to zero:
- AWS Lambda: Up to 10GB container images, 15-minute timeout.
- Google Cloud Functions: Event-driven, auto-scaling.
- Azure Functions: Integrated with Azure ML.
Safe Rollout Strategies
A/B Testing
Route a percentage of traffic to the new model and compare metrics against the existing model. Requires statistical significance testing.
Canary Deployments
Gradually increase traffic to the new model: 1% → 5% → 25% → 100%. Roll back immediately if metrics degrade.
Blue/Green Deployments
Maintain two identical environments. Switch all traffic at once from blue (current) to green (new). Instant rollback by switching back.
| Strategy | Risk | Rollback Speed | Best For |
|---|---|---|---|
| A/B Testing | Low | Instant | Measuring business impact |
| Canary | Very low | Instant | Gradual confidence building |
| Blue/Green | Medium | Instant | Simple, full-switch deployments |
| Shadow | None | N/A | Testing without user impact |
Lilly Tech Systems