Advanced

Model Monitoring

Learn how to detect data drift, concept drift, and performance degradation in production ML systems.

Why Monitor ML Models?

Unlike traditional software that either works or crashes, ML models degrade silently. A model can continue returning predictions while its accuracy deteriorates, leading to poor business outcomes without any error logs.

Types of Drift

Data Drift (Covariate Shift)

The distribution of input features changes from what the model was trained on. For example, if a model was trained on data from users aged 25-45, but starts receiving requests from teenagers, the input distribution has shifted.

Concept Drift

The relationship between inputs and outputs changes. For example, the meaning of "spam" evolves over time as spammers adapt their tactics. The same features now map to different labels.

Prediction Drift

The distribution of model outputs changes. Even if inputs look similar, the model might start predicting differently due to subtle interactions.

Drift TypeWhat ChangesDetection MethodExample
Data DriftInput features P(X)KS test, PSI, Jensen-ShannonUser demographics shift
Concept DriftP(Y|X) relationshipPerformance monitoring, ADWINFraud patterns change
Prediction DriftOutput P(Y)Chi-squared, distribution comparisonModel predicts more positives

Detecting Data Drift

Python — Data drift detection with Evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# Compare reference (training) data with current (production) data
report = Report(metrics=[
    DataDriftPreset(),
    TargetDriftPreset(),
])

report.run(
    reference_data=train_df,
    current_data=production_df,
)

# Save HTML report
report.save_html("drift_report.html")

# Get results programmatically
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
print(f"Dataset drift detected: {drift_detected}")

Performance Degradation Detection

Monitor model performance metrics over time with sliding windows:

Python — Performance monitoring
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

class ModelMonitor:
    def __init__(self, window_size=1000, threshold=0.05):
        self.window_size = window_size
        self.threshold = threshold
        self.baseline_accuracy = None
        self.predictions = []
        self.actuals = []

    def set_baseline(self, accuracy):
        self.baseline_accuracy = accuracy

    def log_prediction(self, predicted, actual):
        self.predictions.append(predicted)
        self.actuals.append(actual)

        if len(self.predictions) >= self.window_size:
            current_accuracy = accuracy_score(
                self.actuals[-self.window_size:],
                self.predictions[-self.window_size:]
            )
            degradation = self.baseline_accuracy - current_accuracy

            if degradation > self.threshold:
                self.trigger_alert(current_accuracy, degradation)

    def trigger_alert(self, current_accuracy, degradation):
        print(f"ALERT: Model degradation detected!")
        print(f"  Baseline: {self.baseline_accuracy:.4f}")
        print(f"  Current:  {current_accuracy:.4f}")
        print(f"  Drop:     {degradation:.4f}")

Feature Importance Monitoring

Track how feature importance changes over time. Significant shifts can indicate data pipeline issues or concept drift:

  • Compare SHAP values between training and production data.
  • Monitor feature contribution distributions.
  • Alert when feature rankings change significantly.

Alert Systems

Set up alerts for different severity levels:

Critical Alerts

Model accuracy below minimum threshold. Immediate page to on-call engineer. Auto-rollback to previous model version.

Warning Alerts

Drift detected but performance still acceptable. Slack notification to ML team. Schedule investigation.

Info Alerts

Minor distribution shifts. Weekly summary email. Track in monitoring dashboard.

Retraining Triggers

  • Scheduled: Retrain on a fixed schedule (daily, weekly, monthly).
  • Performance-triggered: Retrain when accuracy drops below a threshold.
  • Drift-triggered: Retrain when significant data drift is detected.
  • Data volume-triggered: Retrain when enough new labeled data is available.
Best practice: Combine multiple triggers. Use scheduled retraining as a baseline, with performance and drift triggers for urgent updates. Always validate the new model before promoting it.

Monitoring Tools

ToolTypeKey Features
Evidently AIOpen-sourceData drift, model quality, reports + dashboards
WhyLabsSaaSReal-time monitoring, anomaly detection, data profiling
ArizeSaaSEmbedding drift, performance tracing, LLM monitoring
FiddlerSaaSExplainability, fairness monitoring, model audit
Grafana + PrometheusOpen-sourceCustom metrics, alerting, dashboards (general-purpose)

Logging and Observability

Python — Structured logging for ML predictions
import logging
import json
from datetime import datetime

logger = logging.getLogger("ml_predictions")

def log_prediction(request_id, features, prediction, latency_ms):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request_id,
        "features": features,
        "prediction": prediction,
        "confidence": float(prediction["confidence"]),
        "latency_ms": latency_ms,
        "model_version": "v2.3.1",
    }
    logger.info(json.dumps(log_entry))

# Log every prediction for audit trail and monitoring
# Store in a queryable format (BigQuery, Elasticsearch, etc.)