Deploy & Optimize Models (20-25%) Advanced
Deploying models to production is where data science meets engineering. This domain covers managed online endpoints for real-time inference, batch endpoints for large-scale scoring, MLflow model management, blue-green deployments, and monitoring deployed models.
Deployment Options Comparison
| Option | Use Case | Latency | Scaling | Cost Model |
|---|---|---|---|---|
| Managed Online Endpoint | Real-time predictions (API) | Low (ms-sec) | Auto-scale by traffic | Pay per VM uptime |
| Managed Batch Endpoint | Large dataset scoring | Minutes-hours | Parallel compute | Pay per job |
| Kubernetes (AKS) | Custom networking, GPU inference | Low | K8s auto-scaling | AKS cluster cost |
Managed Online Endpoints (Real-Time)
The most common deployment target for real-time scoring. Know how to create endpoints, deploy models, and configure traffic splitting.
# Step 1: Register a model with MLflow
import mlflow
# Register model from a training run
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-prediction-model")
# Or register from local files
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes
model = Model(
path="./model/",
name="churn-prediction-model",
description="Gradient boosting churn classifier",
type=AssetTypes.MLFLOW_MODEL # MLflow format (recommended)
)
ml_client.models.create_or_update(model)
# Step 2: Create managed online endpoint
from azure.ai.ml.entities import (
ManagedOnlineEndpoint,
ManagedOnlineDeployment,
CodeConfiguration,
Environment
)
# Create endpoint (the URL/entry point)
endpoint = ManagedOnlineEndpoint(
name="churn-endpoint",
description="Real-time churn prediction",
auth_mode="key", # "key" or "aml_token"
tags={"model": "gradient-boosting", "version": "1"}
)
ml_client.online_endpoints.begin_create_or_update(endpoint)
# Step 3: Create deployment (the model + compute behind the endpoint)
blue_deployment = ManagedOnlineDeployment(
name="blue",
endpoint_name="churn-endpoint",
model="azureml:churn-prediction-model:1",
instance_type="Standard_DS3_v2",
instance_count=1,
# For MLflow models, no scoring script needed!
# For custom models, specify:
# code_configuration=CodeConfiguration(
# code="./score/",
# scoring_script="score.py"
# ),
# environment="azureml:dp100-custom-env:1"
)
ml_client.online_deployments.begin_create_or_update(blue_deployment)
# Route 100% traffic to blue deployment
endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)
Blue-Green Deployments
Blue-green deployment lets you test new model versions with a percentage of production traffic before full rollout.
# Deploy new model version as "green"
green_deployment = ManagedOnlineDeployment(
name="green",
endpoint_name="churn-endpoint",
model="azureml:churn-prediction-model:2", # New version
instance_type="Standard_DS3_v2",
instance_count=1
)
ml_client.online_deployments.begin_create_or_update(green_deployment)
# Send 10% traffic to green for testing
endpoint.traffic = {"blue": 90, "green": 10}
ml_client.online_endpoints.begin_create_or_update(endpoint)
# After validation, shift all traffic to green
endpoint.traffic = {"blue": 0, "green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)
# Clean up old deployment
ml_client.online_deployments.begin_delete(
name="blue",
endpoint_name="churn-endpoint"
)
Batch Endpoints
Use batch endpoints when you need to score large datasets on a schedule or on-demand, without maintaining always-on infrastructure.
# Create batch endpoint and deployment
from azure.ai.ml.entities import (
BatchEndpoint,
BatchDeployment,
BatchRetrySettings
)
# Create batch endpoint
batch_endpoint = BatchEndpoint(
name="churn-batch-endpoint",
description="Batch scoring for churn predictions"
)
ml_client.batch_endpoints.begin_create_or_update(batch_endpoint)
# Create batch deployment
batch_deployment = BatchDeployment(
name="batch-v1",
endpoint_name="churn-batch-endpoint",
model="azureml:churn-prediction-model:1",
compute="dp100-cluster",
instance_count=2,
max_concurrency_per_instance=4,
mini_batch_size=100,
output_action="append_row", # or "summary_only"
output_file_name="predictions.csv",
retry_settings=BatchRetrySettings(
max_retries=3,
timeout=300 # seconds per mini-batch
)
)
ml_client.batch_deployments.begin_create_or_update(batch_deployment)
# Invoke batch scoring
from azure.ai.ml import Input
job = ml_client.batch_endpoints.invoke(
endpoint_name="churn-batch-endpoint",
input=Input(
type="uri_folder",
path="azureml://datastores/workspaceblobstore/paths/batch-input/"
)
)
print(f"Batch job: {job.name}")
MLflow Model Registry
MLflow is the standard model management framework in Azure ML. Know the model lifecycle stages.
| Stage | Purpose | Who Uses It |
|---|---|---|
| None | Initial registration, experimental | Data scientists during development |
| Staging | Testing and validation | ML engineers validating before production |
| Production | Live serving | Production endpoints |
| Archived | Retired models kept for audit | Compliance and audit teams |
# MLflow model registry operations
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Transition model to staging
client.transition_model_version_stage(
name="churn-prediction-model",
version=2,
stage="Staging"
)
# After testing, promote to production
client.transition_model_version_stage(
name="churn-prediction-model",
version=2,
stage="Production"
)
# Archive old version
client.transition_model_version_stage(
name="churn-prediction-model",
version=1,
stage="Archived"
)
Model Monitoring
Once deployed, models need continuous monitoring for performance degradation, data drift, and operational health.
- Application Insights — Request/response logging, latency, error rates, custom telemetry
- Data drift monitoring — Compare input distributions to training data (covered in Prepare Data lesson)
- Prediction quality monitoring — Track accuracy when ground truth labels become available
- Infrastructure monitoring — CPU/memory utilization, instance health, auto-scaling events
# Enable Application Insights logging on endpoint
# In the scoring script (score.py) for custom deployments:
import logging
import json
def init():
global model
model_path = os.getenv("AZUREML_MODEL_DIR")
model = joblib.load(os.path.join(model_path, "model.pkl"))
logging.info("Model loaded successfully")
def run(raw_data):
try:
data = json.loads(raw_data)
predictions = model.predict(data["input"])
# Custom telemetry logged to Application Insights
logging.info(f"Scored {len(predictions)} records")
logging.info(f"Prediction distribution: {dict(zip(*np.unique(predictions, return_counts=True)))}")
return json.dumps({"predictions": predictions.tolist()})
except Exception as e:
logging.error(f"Scoring error: {str(e)}")
raise
Practice Questions
A. Write a custom scoring script and create a custom Docker image
B. Deploy the MLflow model directly without a scoring script (no-code deployment)
C. Convert the model to ONNX format first
D. Export the model as a pickle file and write a Flask API
Show Answer
B. Deploy the MLflow model directly without a scoring script (no-code deployment). MLflow models registered in Azure ML can be deployed to managed online endpoints without writing a scoring script or specifying an environment. Azure ML automatically generates the inference server based on the MLflow model's signature and requirements. This is the simplest and recommended approach.
A. Delete the old deployment and create a new one
B. Use blue-green deployment with traffic splitting
C. Deploy to a separate endpoint and switch DNS
D. Use A/B testing with Azure Front Door
Show Answer
B. Use blue-green deployment with traffic splitting. Managed online endpoints support multiple deployments (blue/green) with configurable traffic percentages. You deploy the new model as a second deployment, route 5% traffic to it, validate performance, then gradually shift traffic. This is built into Azure ML and requires no external infrastructure.
A. Managed Online Endpoint with auto-scaling
B. Managed Batch Endpoint with compute cluster
C. Azure Kubernetes Service (AKS) with scheduled scaling
D. Azure Functions with HTTP trigger
Show Answer
B. Managed Batch Endpoint with compute cluster. Batch endpoints are designed for large-scale offline scoring. They spin up compute only for the job duration, process data in parallel across multiple nodes, and automatically shut down when complete. This is far more cost-effective than keeping an online endpoint running 24/7 or managing AKS for batch workloads.
A. Retrain the model with fewer features
B. Check Application Insights for resource utilization and request queue depth
C. Redeploy the model with a different framework
D. Switch from managed endpoint to AKS
Show Answer
B. Check Application Insights for resource utilization and request queue depth. Increased latency without model changes typically indicates infrastructure issues: CPU/memory saturation, request queuing due to high traffic, or instance health problems. Application Insights provides the telemetry needed to diagnose the root cause. Solutions might include scaling up instance count or instance type.
A. Keep the previous model version registered in the model registry and maintain a blue/green deployment
B. Store the model weights in Azure Blob Storage
C. Use Azure DevOps pipelines with manual approval gates
D. Create snapshots of the compute instances daily
Show Answer
A. Keep the previous model version registered in the model registry and maintain a blue/green deployment. With blue-green deployments, the previous model version remains deployed (with 0% traffic). To roll back, you simply shift traffic back to the old deployment. Combined with the model registry tracking all versions, this enables rollback in seconds by updating traffic rules, without any redeployment.
Lilly Tech Systems