Advanced

Model Monitoring

Learn how to detect data drift, concept drift, and performance degradation in production ML systems.

Why Monitor ML Models?

Unlike traditional software that either works or crashes, ML models degrade silently. A model can continue returning predictions while its accuracy deteriorates, leading to poor business outcomes without any error logs.

Types of Drift

Data Drift (Covariate Shift)

The distribution of input features changes from what the model was trained on. For example, if a model was trained on data from users aged 25-45, but starts receiving requests from teenagers, the input distribution has shifted.

Concept Drift

The relationship between inputs and outputs changes. For example, the meaning of "spam" evolves over time as spammers adapt their tactics. The same features now map to different labels.

Prediction Drift

The distribution of model outputs changes. Even if inputs look similar, the model might start predicting differently due to subtle interactions.

Drift Type	What Changes	Detection Method	Example
Data Drift	Input features P(X)	KS test, PSI, Jensen-Shannon	User demographics shift
Concept Drift	P(Y\|X) relationship	Performance monitoring, ADWIN	Fraud patterns change
Prediction Drift	Output P(Y)	Chi-squared, distribution comparison	Model predicts more positives

Detecting Data Drift

Python — Data drift detection with Evidently

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Compare reference (training) data with current (production) data
report = Report(metrics=[
    DataDriftPreset(),
    TargetDriftPreset(),
])

report.run(
    reference_data=train_df,
    current_data=production_df,
)

# Save HTML report
report.save_html("drift_report.html")

# Get results programmatically
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
print(f"Dataset drift detected: {drift_detected}")

Performance Degradation Detection

Monitor model performance metrics over time with sliding windows:

Python — Performance monitoring

import numpy as np
from sklearn.metrics import accuracy_score, f1_score

class ModelMonitor:
    def __init__(self, window_size=1000, threshold=0.05):
        self.window_size = window_size
        self.threshold = threshold
        self.baseline_accuracy = None
        self.predictions = []
        self.actuals = []

    def set_baseline(self, accuracy):
        self.baseline_accuracy = accuracy

    def log_prediction(self, predicted, actual):
        self.predictions.append(predicted)
        self.actuals.append(actual)

        if len(self.predictions) >= self.window_size:
            current_accuracy = accuracy_score(
                self.actuals[-self.window_size:],
                self.predictions[-self.window_size:]
            )
            degradation = self.baseline_accuracy - current_accuracy

            if degradation > self.threshold:
                self.trigger_alert(current_accuracy, degradation)

    def trigger_alert(self, current_accuracy, degradation):
        print(f"ALERT: Model degradation detected!")
        print(f"  Baseline: {self.baseline_accuracy:.4f}")
        print(f"  Current:  {current_accuracy:.4f}")
        print(f"  Drop:     {degradation:.4f}")

Feature Importance Monitoring

Track how feature importance changes over time. Significant shifts can indicate data pipeline issues or concept drift:

Compare SHAP values between training and production data.
Monitor feature contribution distributions.
Alert when feature rankings change significantly.

Alert Systems

Set up alerts for different severity levels:

Critical Alerts

Model accuracy below minimum threshold. Immediate page to on-call engineer. Auto-rollback to previous model version.

Warning Alerts

Drift detected but performance still acceptable. Slack notification to ML team. Schedule investigation.

Info Alerts

Minor distribution shifts. Weekly summary email. Track in monitoring dashboard.

Retraining Triggers

Scheduled: Retrain on a fixed schedule (daily, weekly, monthly).
Performance-triggered: Retrain when accuracy drops below a threshold.
Drift-triggered: Retrain when significant data drift is detected.
Data volume-triggered: Retrain when enough new labeled data is available.

✅

Best practice: Combine multiple triggers. Use scheduled retraining as a baseline, with performance and drift triggers for urgent updates. Always validate the new model before promoting it.

Monitoring Tools

Tool	Type	Key Features
Evidently AI	Open-source	Data drift, model quality, reports + dashboards
WhyLabs	SaaS	Real-time monitoring, anomaly detection, data profiling
Arize	SaaS	Embedding drift, performance tracing, LLM monitoring
Fiddler	SaaS	Explainability, fairness monitoring, model audit
Grafana + Prometheus	Open-source	Custom metrics, alerting, dashboards (general-purpose)

Logging and Observability

Python — Structured logging for ML predictions

import logging
import json
from datetime import datetime

logger = logging.getLogger("ml_predictions")

def log_prediction(request_id, features, prediction, latency_ms):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request_id,
        "features": features,
        "prediction": prediction,
        "confidence": float(prediction["confidence"]),
        "latency_ms": latency_ms,
        "model_version": "v2.3.1",
    }
    logger.info(json.dumps(log_entry))

# Log every prediction for audit trail and monitoring
# Store in a queryable format (BigQuery, Elasticsearch, etc.)

← Previous Model Deployment Next → CI/CD for ML