Model Monitoring
Learn how to detect data drift, concept drift, and performance degradation in production ML systems.
Why Monitor ML Models?
Unlike traditional software that either works or crashes, ML models degrade silently. A model can continue returning predictions while its accuracy deteriorates, leading to poor business outcomes without any error logs.
Types of Drift
Data Drift (Covariate Shift)
The distribution of input features changes from what the model was trained on. For example, if a model was trained on data from users aged 25-45, but starts receiving requests from teenagers, the input distribution has shifted.
Concept Drift
The relationship between inputs and outputs changes. For example, the meaning of "spam" evolves over time as spammers adapt their tactics. The same features now map to different labels.
Prediction Drift
The distribution of model outputs changes. Even if inputs look similar, the model might start predicting differently due to subtle interactions.
| Drift Type | What Changes | Detection Method | Example |
|---|---|---|---|
| Data Drift | Input features P(X) | KS test, PSI, Jensen-Shannon | User demographics shift |
| Concept Drift | P(Y|X) relationship | Performance monitoring, ADWIN | Fraud patterns change |
| Prediction Drift | Output P(Y) | Chi-squared, distribution comparison | Model predicts more positives |
Detecting Data Drift
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
# Compare reference (training) data with current (production) data
report = Report(metrics=[
DataDriftPreset(),
TargetDriftPreset(),
])
report.run(
reference_data=train_df,
current_data=production_df,
)
# Save HTML report
report.save_html("drift_report.html")
# Get results programmatically
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
print(f"Dataset drift detected: {drift_detected}")
Performance Degradation Detection
Monitor model performance metrics over time with sliding windows:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
class ModelMonitor:
def __init__(self, window_size=1000, threshold=0.05):
self.window_size = window_size
self.threshold = threshold
self.baseline_accuracy = None
self.predictions = []
self.actuals = []
def set_baseline(self, accuracy):
self.baseline_accuracy = accuracy
def log_prediction(self, predicted, actual):
self.predictions.append(predicted)
self.actuals.append(actual)
if len(self.predictions) >= self.window_size:
current_accuracy = accuracy_score(
self.actuals[-self.window_size:],
self.predictions[-self.window_size:]
)
degradation = self.baseline_accuracy - current_accuracy
if degradation > self.threshold:
self.trigger_alert(current_accuracy, degradation)
def trigger_alert(self, current_accuracy, degradation):
print(f"ALERT: Model degradation detected!")
print(f" Baseline: {self.baseline_accuracy:.4f}")
print(f" Current: {current_accuracy:.4f}")
print(f" Drop: {degradation:.4f}")
Feature Importance Monitoring
Track how feature importance changes over time. Significant shifts can indicate data pipeline issues or concept drift:
- Compare SHAP values between training and production data.
- Monitor feature contribution distributions.
- Alert when feature rankings change significantly.
Alert Systems
Set up alerts for different severity levels:
Critical Alerts
Model accuracy below minimum threshold. Immediate page to on-call engineer. Auto-rollback to previous model version.
Warning Alerts
Drift detected but performance still acceptable. Slack notification to ML team. Schedule investigation.
Info Alerts
Minor distribution shifts. Weekly summary email. Track in monitoring dashboard.
Retraining Triggers
- Scheduled: Retrain on a fixed schedule (daily, weekly, monthly).
- Performance-triggered: Retrain when accuracy drops below a threshold.
- Drift-triggered: Retrain when significant data drift is detected.
- Data volume-triggered: Retrain when enough new labeled data is available.
Monitoring Tools
| Tool | Type | Key Features |
|---|---|---|
| Evidently AI | Open-source | Data drift, model quality, reports + dashboards |
| WhyLabs | SaaS | Real-time monitoring, anomaly detection, data profiling |
| Arize | SaaS | Embedding drift, performance tracing, LLM monitoring |
| Fiddler | SaaS | Explainability, fairness monitoring, model audit |
| Grafana + Prometheus | Open-source | Custom metrics, alerting, dashboards (general-purpose) |
Logging and Observability
import logging
import json
from datetime import datetime
logger = logging.getLogger("ml_predictions")
def log_prediction(request_id, features, prediction, latency_ms):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"request_id": request_id,
"features": features,
"prediction": prediction,
"confidence": float(prediction["confidence"]),
"latency_ms": latency_ms,
"model_version": "v2.3.1",
}
logger.info(json.dumps(log_entry))
# Log every prediction for audit trail and monitoring
# Store in a queryable format (BigQuery, Elasticsearch, etc.)
Lilly Tech Systems