Deploy & Optimize Models (20-25%) Advanced

Deploying models to production is where data science meets engineering. This domain covers managed online endpoints for real-time inference, batch endpoints for large-scale scoring, MLflow model management, blue-green deployments, and monitoring deployed models.

Deployment Options Comparison

Option	Use Case	Latency	Scaling	Cost Model
Managed Online Endpoint	Real-time predictions (API)	Low (ms-sec)	Auto-scale by traffic	Pay per VM uptime
Managed Batch Endpoint	Large dataset scoring	Minutes-hours	Parallel compute	Pay per job
Kubernetes (AKS)	Custom networking, GPU inference	Low	K8s auto-scaling	AKS cluster cost

Managed Online Endpoints (Real-Time)

The most common deployment target for real-time scoring. Know how to create endpoints, deploy models, and configure traffic splitting.

# Step 1: Register a model with MLflow
import mlflow

# Register model from a training run
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "churn-prediction-model")

# Or register from local files
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

model = Model(
    path="./model/",
    name="churn-prediction-model",
    description="Gradient boosting churn classifier",
    type=AssetTypes.MLFLOW_MODEL   # MLflow format (recommended)
)
ml_client.models.create_or_update(model)

# Step 2: Create managed online endpoint
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    CodeConfiguration,
    Environment
)

# Create endpoint (the URL/entry point)
endpoint = ManagedOnlineEndpoint(
    name="churn-endpoint",
    description="Real-time churn prediction",
    auth_mode="key",           # "key" or "aml_token"
    tags={"model": "gradient-boosting", "version": "1"}
)

ml_client.online_endpoints.begin_create_or_update(endpoint)

# Step 3: Create deployment (the model + compute behind the endpoint)
blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="churn-endpoint",
    model="azureml:churn-prediction-model:1",
    instance_type="Standard_DS3_v2",
    instance_count=1,
    # For MLflow models, no scoring script needed!
    # For custom models, specify:
    # code_configuration=CodeConfiguration(
    #     code="./score/",
    #     scoring_script="score.py"
    # ),
    # environment="azureml:dp100-custom-env:1"
)

ml_client.online_deployments.begin_create_or_update(blue_deployment)

# Route 100% traffic to blue deployment
endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

Exam Tip: MLflow models deployed to managed endpoints do NOT need a scoring script or custom environment. Azure ML automatically generates the scoring endpoint from the MLflow model signature. This is called "no-code deployment" and is the recommended approach. The exam often tests when you need vs. do not need a scoring script.

Blue-Green Deployments

Blue-green deployment lets you test new model versions with a percentage of production traffic before full rollout.

# Deploy new model version as "green"
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name="churn-endpoint",
    model="azureml:churn-prediction-model:2",   # New version
    instance_type="Standard_DS3_v2",
    instance_count=1
)

ml_client.online_deployments.begin_create_or_update(green_deployment)

# Send 10% traffic to green for testing
endpoint.traffic = {"blue": 90, "green": 10}
ml_client.online_endpoints.begin_create_or_update(endpoint)

# After validation, shift all traffic to green
endpoint.traffic = {"blue": 0, "green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

# Clean up old deployment
ml_client.online_deployments.begin_delete(
    name="blue",
    endpoint_name="churn-endpoint"
)

Batch Endpoints

Use batch endpoints when you need to score large datasets on a schedule or on-demand, without maintaining always-on infrastructure.

# Create batch endpoint and deployment
from azure.ai.ml.entities import (
    BatchEndpoint,
    BatchDeployment,
    BatchRetrySettings
)

# Create batch endpoint
batch_endpoint = BatchEndpoint(
    name="churn-batch-endpoint",
    description="Batch scoring for churn predictions"
)
ml_client.batch_endpoints.begin_create_or_update(batch_endpoint)

# Create batch deployment
batch_deployment = BatchDeployment(
    name="batch-v1",
    endpoint_name="churn-batch-endpoint",
    model="azureml:churn-prediction-model:1",
    compute="dp100-cluster",
    instance_count=2,
    max_concurrency_per_instance=4,
    mini_batch_size=100,
    output_action="append_row",        # or "summary_only"
    output_file_name="predictions.csv",
    retry_settings=BatchRetrySettings(
        max_retries=3,
        timeout=300                    # seconds per mini-batch
    )
)
ml_client.batch_deployments.begin_create_or_update(batch_deployment)

# Invoke batch scoring
from azure.ai.ml import Input

job = ml_client.batch_endpoints.invoke(
    endpoint_name="churn-batch-endpoint",
    input=Input(
        type="uri_folder",
        path="azureml://datastores/workspaceblobstore/paths/batch-input/"
    )
)
print(f"Batch job: {job.name}")

MLflow Model Registry

MLflow is the standard model management framework in Azure ML. Know the model lifecycle stages.

Stage	Purpose	Who Uses It
None	Initial registration, experimental	Data scientists during development
Staging	Testing and validation	ML engineers validating before production
Production	Live serving	Production endpoints
Archived	Retired models kept for audit	Compliance and audit teams

# MLflow model registry operations
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Transition model to staging
client.transition_model_version_stage(
    name="churn-prediction-model",
    version=2,
    stage="Staging"
)

# After testing, promote to production
client.transition_model_version_stage(
    name="churn-prediction-model",
    version=2,
    stage="Production"
)

# Archive old version
client.transition_model_version_stage(
    name="churn-prediction-model",
    version=1,
    stage="Archived"
)

Model Monitoring

Once deployed, models need continuous monitoring for performance degradation, data drift, and operational health.

Application Insights — Request/response logging, latency, error rates, custom telemetry
Data drift monitoring — Compare input distributions to training data (covered in Prepare Data lesson)
Prediction quality monitoring — Track accuracy when ground truth labels become available
Infrastructure monitoring — CPU/memory utilization, instance health, auto-scaling events

# Enable Application Insights logging on endpoint
# In the scoring script (score.py) for custom deployments:
import logging
import json

def init():
    global model
    model_path = os.getenv("AZUREML_MODEL_DIR")
    model = joblib.load(os.path.join(model_path, "model.pkl"))
    logging.info("Model loaded successfully")

def run(raw_data):
    try:
        data = json.loads(raw_data)
        predictions = model.predict(data["input"])

        # Custom telemetry logged to Application Insights
        logging.info(f"Scored {len(predictions)} records")
        logging.info(f"Prediction distribution: {dict(zip(*np.unique(predictions, return_counts=True)))}")

        return json.dumps({"predictions": predictions.tolist()})
    except Exception as e:
        logging.error(f"Scoring error: {str(e)}")
        raise

Practice Questions

Question 1: You trained a model using scikit-learn and logged it with MLflow. You need to deploy it to a managed online endpoint with minimal code. What should you do?

A. Write a custom scoring script and create a custom Docker image
B. Deploy the MLflow model directly without a scoring script (no-code deployment)
C. Convert the model to ONNX format first
D. Export the model as a pickle file and write a Flask API

Show Answer

B. Deploy the MLflow model directly without a scoring script (no-code deployment). MLflow models registered in Azure ML can be deployed to managed online endpoints without writing a scoring script or specifying an environment. Azure ML automatically generates the inference server based on the MLflow model's signature and requirements. This is the simplest and recommended approach.

Question 2: You need to deploy a new version of a production model. You want to test it with 5% of live traffic before full rollout. Which deployment strategy should you use?

A. Delete the old deployment and create a new one
B. Use blue-green deployment with traffic splitting
C. Deploy to a separate endpoint and switch DNS
D. Use A/B testing with Azure Front Door

Show Answer

B. Use blue-green deployment with traffic splitting. Managed online endpoints support multiple deployments (blue/green) with configurable traffic percentages. You deploy the new model as a second deployment, route 5% traffic to it, validate performance, then gradually shift traffic. This is built into Azure ML and requires no external infrastructure.

Question 3: You need to score 10 million records every night. The predictions are not time-sensitive and can take up to 2 hours. Which deployment option is most cost-effective?

A. Managed Online Endpoint with auto-scaling
B. Managed Batch Endpoint with compute cluster
C. Azure Kubernetes Service (AKS) with scheduled scaling
D. Azure Functions with HTTP trigger

Show Answer

B. Managed Batch Endpoint with compute cluster. Batch endpoints are designed for large-scale offline scoring. They spin up compute only for the job duration, process data in parallel across multiple nodes, and automatically shut down when complete. This is far more cost-effective than keeping an online endpoint running 24/7 or managing AKS for batch workloads.

Question 4: After deploying a model, you notice that the endpoint latency has increased from 50ms to 500ms. The model has not changed. What should you check first?

A. Retrain the model with fewer features
B. Check Application Insights for resource utilization and request queue depth
C. Redeploy the model with a different framework
D. Switch from managed endpoint to AKS

Show Answer

B. Check Application Insights for resource utilization and request queue depth. Increased latency without model changes typically indicates infrastructure issues: CPU/memory saturation, request queuing due to high traffic, or instance health problems. Application Insights provides the telemetry needed to diagnose the root cause. Solutions might include scaling up instance count or instance type.

Question 5: You need to ensure that a production model can be rolled back to a previous version within minutes if issues are detected. What should you implement?

A. Keep the previous model version registered in the model registry and maintain a blue/green deployment
B. Store the model weights in Azure Blob Storage
C. Use Azure DevOps pipelines with manual approval gates
D. Create snapshots of the compute instances daily

Show Answer

A. Keep the previous model version registered in the model registry and maintain a blue/green deployment. With blue-green deployments, the previous model version remains deployed (with 0% traffic). To roll back, you simply shift traffic back to the old deployment. Combined with the model registry tracking all versions, this enables rollback in seconds by updating traffic rules, without any redeployment.

← Prepare Data for Modeling Responsible AI →