Intermediate

Monitoring & Observability

These 12 questions cover the monitoring and observability challenges unique to ML systems. Unlike traditional software where bugs produce errors, ML models can silently degrade — making monitoring the most critical defense against production failures.

Q1: What is data drift and how do you detect it?

💡
Model Answer:

Data drift (also called covariate shift) occurs when the statistical distribution of input data in production differs from the distribution the model was trained on. The model's learned patterns may no longer apply.

Detection methods:

  • Population Stability Index (PSI): Compares the distribution of a feature between training and production. PSI < 0.1 = no drift, 0.1–0.2 = moderate drift, > 0.2 = significant drift. Simple, interpretable, works for both categorical and numerical features.
  • Kolmogorov-Smirnov (KS) test: Non-parametric test that compares two distributions. Returns a p-value; if p < 0.05, distributions are significantly different. Good for numerical features but sensitive to sample size.
  • Jensen-Shannon divergence: Symmetric measure of difference between two probability distributions. Ranges from 0 (identical) to 1 (maximally different). More robust than KL divergence because it is defined even when distributions do not overlap.
  • Multivariate drift detection: Single-feature tests miss correlations. Use Maximum Mean Discrepancy (MMD) or train a classifier to distinguish training from production data. If the classifier achieves high accuracy, distributions are different.

Practical approach: Monitor the top 10–20 most important features using PSI with a daily cadence. Set alerts at PSI > 0.15. For critical models, run multivariate drift detection weekly. Use tools like Evidently AI, WhyLabs, or NannyML for automated drift reports.

Q2: What is the difference between data drift, concept drift, and prediction drift?

💡
Model Answer:
TypeWhat ChangesExampleDetection
Data driftInput feature distributions P(X)After a marketing campaign, customer demographics shift to a younger population. The model receives inputs it has rarely seen.Statistical tests on features (PSI, KS test)
Concept driftRelationship between features and target P(Y|X)During a recession, the factors that predict customer churn change. Same features, different outcomes.Monitor model performance metrics over time. Requires ground truth labels.
Prediction driftOutput distribution P(Y_hat)A fraud model suddenly flags 30% of transactions instead of the usual 2%. Input distribution may look fine, but predictions have shifted.Monitor prediction distribution (mean, variance, percentiles)

Key insight: Prediction drift is the most actionable because it does not require ground truth labels and catches problems quickly. If your model's prediction distribution changes, something is wrong — even if you cannot immediately tell whether the root cause is data drift or concept drift. Monitor all three, but alert on prediction drift first.

Q3: What metrics should you monitor for a production ML model?

💡
Model Answer:

Production ML monitoring has four layers:

  • Infrastructure metrics: CPU/GPU utilization, memory usage, disk I/O, network latency, container health, pod restarts. These are standard DevOps metrics but critical for ML because GPU OOM errors and memory leaks are common.
  • Service metrics: Request rate (QPS), latency (p50, p95, p99), error rate (4xx, 5xx), queue depth, batch size distribution. These tell you if the serving infrastructure is healthy.
  • Model quality metrics: Prediction distribution (mean, variance, percentiles), feature importance stability, confidence score distribution, prediction-to-label delay. These catch model-specific issues that infrastructure metrics miss.
  • Business metrics: Revenue impact, user engagement, conversion rate, customer satisfaction. Ultimately, a model that serves fast but produces bad predictions is worse than no model at all. Map model predictions to business outcomes.

Monitoring stack example: Prometheus for metrics collection, Grafana for dashboards, PagerDuty for alerting, Evidently AI for drift detection, custom logging to BigQuery for offline analysis.

Q4: How do you set up alerting for ML systems? What should trigger an alert?

💡
Model Answer:

ML alerting must balance between catching real issues and avoiding alert fatigue. Use tiered severity:

SeverityConditionResponseChannel
P0 (Critical)Model endpoint is down, error rate > 10%, latency p99 > 5x SLAImmediate page, wake someone upPagerDuty, phone call
P1 (High)Prediction drift detected, accuracy drop > 5%, data pipeline delayed > 2 hoursInvestigate within 1 hour during business hoursSlack #ml-alerts, email
P2 (Medium)Data drift in non-critical features, model confidence decreasing, GPU utilization > 90%Investigate within 1 business daySlack #ml-monitoring
P3 (Low)Minor distribution shifts, experiment tracking issues, non-critical pipeline warningsReview in next sprint planningWeekly report

Key practices: (1) Alert on symptoms (high error rate) not causes (high CPU) — causes can have many symptoms, but users feel symptoms. (2) Include runbook links in every alert so the on-call engineer knows what to do. (3) Review and prune alerts quarterly — if an alert fires frequently and is always ignored, either fix the root cause or remove the alert. (4) Use anomaly detection (not just static thresholds) for metrics with natural variation like prediction volume.

Q5: How do you build an ML monitoring dashboard? What panels should it include?

💡
Model Answer:

An effective ML dashboard follows the "overview-then-detail" pattern. Start with high-level health, then drill down:

  • Top row — Service health: Traffic (QPS), error rate, latency p50/p99, model version currently serving. Green/yellow/red status indicator. This row answers "Is the service up and responsive?"
  • Second row — Prediction quality: Prediction distribution histogram (current vs. baseline), mean prediction over time, confidence score distribution, prediction volume by class. This answers "Are predictions looking normal?"
  • Third row — Data quality: Feature drift scores (PSI) for top features, missing value rates, input schema violations, data freshness (time since last training data update). This answers "Is the input data healthy?"
  • Fourth row — Model performance: Accuracy/F1/AUC over time (if ground truth is available), A/B test results, business KPI trends. This answers "Is the model delivering value?"
  • Bottom row — Infrastructure: GPU/CPU utilization, memory usage, queue depth, auto-scaling events, cost per prediction. This answers "Are we running efficiently?"

Pro tip: Build separate dashboards for different audiences. Engineers need latency percentiles and error traces. Data scientists need drift scores and accuracy trends. Business stakeholders need revenue impact and KPI dashboards. A single dashboard for everyone satisfies no one.

Q6: What are SLAs, SLOs, and SLIs for ML services?

💡
Model Answer:
  • SLI (Service Level Indicator): A quantitative measure of service behavior. Example: "99.2% of requests completed in under 100ms." SLIs are the raw metrics you measure.
  • SLO (Service Level Objective): A target value for an SLI. Example: "99.5% of requests must complete in under 100ms." SLOs are internal targets your team commits to.
  • SLA (Service Level Agreement): A contract with customers that includes consequences for missing targets. Example: "If availability drops below 99.9%, customers receive service credits." SLAs are external commitments.

ML-specific SLOs to define:

  • Availability: 99.95% uptime for the prediction endpoint
  • Latency: p99 < 100ms for online inference
  • Freshness: Model retrained within 24 hours of drift detection
  • Quality: Model accuracy stays within 2% of baseline on rolling 7-day window
  • Throughput: Support 10,000 predictions per second at peak

Error budget: If your SLO is 99.9% availability (43 minutes of downtime per month), and you have used 30 minutes this month, you have 13 minutes of error budget left. If the budget is exhausted, freeze deployments until next month. This forces teams to prioritize reliability over new features when needed.

Q7: How do you detect model degradation when ground truth labels are delayed?

💡
Model Answer:

Many ML systems have delayed feedback. Fraud labels arrive days later. Churn happens weeks after prediction. In these cases, you cannot compute accuracy in real time. Use proxy metrics instead:

  • Prediction distribution monitoring: If the model suddenly predicts "fraud" for 15% of transactions instead of the usual 2%, something changed. Track mean, median, and percentile values of prediction scores over time.
  • Input drift as a leading indicator: If input features drift significantly, model performance is likely to follow. Monitor feature distributions and alert before accuracy degrades.
  • Confidence calibration: Track the model's confidence scores. A well-calibrated model that says "80% probability" should be correct 80% of the time. If calibration deteriorates (checked against delayed ground truth), the model is degrading.
  • Human-in-the-loop sampling: Send a random 1% of predictions for human review. This provides fast (hours, not weeks) quality signals. Expensive but essential for high-stakes models.
  • Upstream/downstream metrics: Even without labels, you can monitor downstream business metrics (conversion rate, customer complaints) that correlate with model quality.

When delayed labels finally arrive: Compute lagging accuracy metrics and plot them against the proxy metrics you tracked in real time. This lets you calibrate your proxy thresholds — "When PSI exceeded 0.2, accuracy dropped 5% three weeks later." Over time, your proxy metrics become reliable early warning systems.

Q8: How do you implement logging for ML inference in production?

💡
Model Answer:

ML inference logging must capture enough information to debug issues and compute metrics, without adding excessive latency or storage cost:

  • What to log per request: Request ID, timestamp, model version, input features (or a hash if sensitive), prediction output, confidence score, latency breakdown (feature computation, inference, post-processing), and any errors.
  • Sampling strategy: Log 100% of metadata (request ID, latency, model version, prediction). Sample input features at 1–10% to reduce storage cost. Log 100% of error cases and low-confidence predictions for debugging.
  • Storage architecture: Write logs to a fast append-only store (Kafka) for real-time monitoring, then batch into a data warehouse (BigQuery, Snowflake) for offline analysis. Keep hot data (last 7 days) in fast storage, archive older data to cold storage (S3 Glacier).
  • Privacy compliance: Never log PII (personally identifiable information) unless required and encrypted. Hash or tokenize user IDs. Implement data retention policies (delete logs after 90 days unless required for audit). GDPR right-to-deletion applies to inference logs.
# Structured logging for ML inference
import structlog

logger = structlog.get_logger()

def predict(request):
    start = time.time()
    features = compute_features(request)
    prediction = model.predict(features)
    latency = (time.time() - start) * 1000

    logger.info("inference_complete",
        request_id=request.id,
        model_version="v2.3.1",
        prediction=float(prediction.score),
        confidence=float(prediction.confidence),
        latency_ms=round(latency, 2),
        feature_hash=hash_features(features),
    )
    return prediction

Q9: What is the feedback loop problem in ML monitoring?

💡
Model Answer:

The feedback loop problem occurs when a model's predictions influence the data it will be retrained on, creating a self-reinforcing cycle that can amplify biases or mask errors.

Example: A content recommendation model promotes certain articles. Users click on those articles (because they are shown). The model is retrained on click data, learning to promote those articles even more. Content that was never shown gets zero clicks and is never recommended, regardless of quality.

Mitigation strategies:

  • Exploration/exploitation: Reserve 5–10% of traffic for random exploration. Show items the model would not normally recommend to collect unbiased feedback data.
  • Counterfactual evaluation: Use inverse propensity scoring to de-bias logged data. Weight examples by the inverse probability that the model would have shown them.
  • Holdout monitoring: Keep a population that receives non-personalized (random or rule-based) recommendations. Compare their outcomes with the model-served population to detect bias amplification.
  • Diversity constraints: Force the model to maintain diversity in its predictions. If a recommendation model always shows the same 100 items to everyone, add a diversity penalty.

In interviews, mention: Feedback loops are especially dangerous in high-stakes domains like criminal justice (predictive policing), lending (credit scoring), and hiring (resume screening). A biased model creates biased data, which trains a more biased model. Breaking this loop requires intentional intervention, not just better algorithms.

Q10: How do you handle on-call for ML systems? What is different from traditional software on-call?

💡
Model Answer:

ML on-call adds complexity because issues can be caused by data, models, or infrastructure — and the root cause is often not obvious:

  • Traditional on-call: Service is down or throwing errors. Root cause is usually in code or infrastructure. Fix: rollback, restart, scale up. Diagnosis: error logs, stack traces.
  • ML on-call: Service might be up and healthy but producing wrong predictions. Root cause might be upstream data quality, model drift, feature store staleness, or a combination. Diagnosis requires checking data pipelines, feature distributions, and model performance — not just error logs.

ML on-call runbook should include:

  1. Check service health (is the endpoint responding?)
  2. Check prediction distribution (are outputs normal?)
  3. Check input data quality (are features within expected ranges?)
  4. Check upstream data pipelines (is fresh data flowing?)
  5. Check model version (was a new model deployed recently?)
  6. Check feature store freshness (are cached features stale?)
  7. If model degradation is confirmed, roll back to previous model version
  8. If data pipeline issue, switch to fallback/default predictions while the pipeline is fixed

Best practice: Require ML engineers to do on-call rotations for the models they build. This creates a "you build it, you run it" culture that improves model quality and monitoring from the start.

Q11: What is a model health score and how do you compute one?

💡
Model Answer:

A model health score is a single composite metric (0–100) that summarizes the overall health of a production model. It aggregates multiple signals into one number that even non-technical stakeholders can understand.

Components and weights (example):

  • Service availability (25%): Uptime percentage over the last 24 hours. 99.9% = 25 points, 99% = 20 points, <99% = 10 points.
  • Latency compliance (20%): Percentage of requests meeting latency SLO. 99% within SLO = 20 points, 95% = 15 points, <90% = 5 points.
  • Data drift (20%): Inverse of average PSI across top features. No drift = 20 points, moderate drift = 10 points, severe drift = 0 points.
  • Prediction stability (20%): How much the prediction distribution has changed from baseline. Stable = 20 points, shifted = 10 points, dramatically different = 0 points.
  • Data freshness (15%): How recently the model was retrained or data pipelines updated. Within schedule = 15 points, overdue = 5 points, severely overdue = 0 points.

Usage: Display on a dashboard as a traffic light. Green (>80), Yellow (50–80), Red (<50). Teams review red models in daily standups. This makes ML monitoring accessible to product managers and executives who do not understand PSI or KS statistics.

Q12: How do you implement A/B test monitoring for ML models?

💡
Model Answer:

A/B test monitoring for ML models requires tracking both technical and business metrics across control and treatment groups:

  • Guardrail metrics: Metrics that must not degrade, regardless of the experiment goal. Examples: page load time, error rate, revenue per user. If any guardrail metric degrades by more than 1%, stop the experiment automatically.
  • Primary metric: The business metric you are trying to improve. Example: click-through rate, conversion rate, user engagement. This determines whether the new model is better.
  • Statistical significance: Do not call an experiment until you have sufficient sample size for statistical power (usually 80%). Use sequential testing (not fixed-horizon) so you can stop experiments early if results are clearly positive or negative.
  • Segment analysis: Check that improvements are consistent across user segments (new vs returning users, mobile vs desktop, different geographies). A model that improves overall CTR by 2% but decreases CTR for a minority segment by 10% may not be acceptable.
  • Novelty and primacy effects: Users may initially engage more with any change (novelty effect) or stick with familiar behavior (primacy effect). Run experiments for at least 2 weeks to let these effects wash out before making decisions.

Tools: Feature flagging services (LaunchDarkly, Statsig, Eppo) provide built-in A/B test analysis. For custom analysis, log experiment assignments and outcomes to a data warehouse and compute metrics in SQL or Python.