Intermediate

CI/CD for ML Questions

These 10 questions cover CI/CD practices adapted for machine learning systems. Traditional CI/CD handles code changes; ML CI/CD must also handle data changes, model retraining, and validation — a much harder problem.

Q1: How does CI/CD for ML differ from traditional software CI/CD?

💡

Model Answer:

Traditional CI/CD pipelines test code and deploy binaries. ML CI/CD must handle three additional dimensions:

Data testing: Validate that training data meets quality, schema, and distribution expectations. A code change might not break anything, but a data change can silently degrade model performance.
Model testing: Beyond unit tests, ML CI/CD must run model evaluation (accuracy, fairness, latency benchmarks) and compare against baseline. A model that passes all code tests might still produce terrible predictions.
Artifact management: Traditional CI/CD produces a binary. ML CI/CD produces a model artifact (potentially multi-GB), training metrics, evaluation reports, and data lineage — all of which must be versioned and stored.

Key insight: In traditional software, only code changes trigger the pipeline. In ML, there are three triggers: code changes (new feature engineering, model architecture), data changes (new training data available, data distribution shift), and schedule (periodic retraining to stay current). Your CI/CD system must handle all three.

Q2: What does a production ML training pipeline look like end-to-end?

💡

Model Answer:

A production training pipeline has these stages, each with validation gates:

Data ingestion: Pull data from sources (data warehouse, S3, streaming). Validate schema, row counts, and freshness. Fail if data is stale or malformed.
Data validation: Run statistical tests (Great Expectations, TensorFlow Data Validation). Check for missing values, outlier distributions, feature correlations. Compare against a reference dataset baseline.
Feature engineering: Transform raw data into features. Compute features using the same code as the online serving path to avoid training-serving skew. Log feature statistics.
Model training: Train with fixed hyperparameters or run hyperparameter tuning. Log all metrics to experiment tracker (MLflow, W&B). Set compute budget limits to prevent runaway costs.
Model evaluation: Evaluate on held-out test set. Compare against the current production model baseline. Check fairness metrics across demographic groups. Run latency benchmarks.
Model validation gate: Automated checks: accuracy >= baseline - 1%, latency p99 < 100ms, no fairness regression. If any check fails, the pipeline stops and alerts the team.
Model registration: If validation passes, register the model artifact in the model registry with all metadata (metrics, data version, code commit).
Deployment: Trigger a deployment pipeline (canary or blue-green) that promotes the registered model to production.

Tools: Kubeflow Pipelines, Apache Airflow, Vertex AI Pipelines, or AWS Step Functions for orchestration. Each step is a containerized component for reproducibility.

Q3: What types of tests should you run in an ML CI/CD pipeline?

💡

Model Answer:

ML testing has layers beyond traditional unit/integration tests:

Test Type	What It Checks	When It Runs	Example
Unit tests	Individual functions work correctly	Every commit	Feature engineering function returns expected output for known input
Data tests	Training data meets expectations	Before training	No null values in required columns, feature distributions within 2 sigma of baseline
Model tests	Model quality meets thresholds	After training	Accuracy >= 0.92, F1 >= 0.88, AUC >= 0.95 on test set
Behavioral tests	Model handles edge cases correctly	After training	Model does not predict differently for “John Smith” vs “Juan Garcia” with identical features
Integration tests	End-to-end pipeline works	Every PR	Pipeline runs successfully on a small sample dataset
Performance tests	Latency and throughput are acceptable	Before deployment	p99 latency < 100ms, throughput > 1000 req/s
Regression tests	New model is not worse than current	Before deployment	New model accuracy >= production model accuracy - 0.01

Key practice: Run fast tests (unit, lint) on every commit. Run expensive tests (full training, model evaluation) only when training code or data changes. Use a small sample dataset for CI tests to keep pipeline time under 15 minutes.

Q4: How do you implement model validation gates?

💡

Model Answer:

A model validation gate is an automated checkpoint that decides whether a newly trained model is safe to deploy. It compares the candidate model against the current production model (or a fixed baseline) on multiple criteria:

Performance gate: Candidate accuracy must be within a threshold of the baseline (e.g., accuracy >= baseline - 0.5%). Use a holdout test set that is never used during training or hyperparameter tuning.
Fairness gate: Check that performance is consistent across demographic groups. If accuracy for group A is 95% but group B is 80%, the model should not pass. Use metrics like equalized odds or demographic parity.
Latency gate: Run inference benchmarks on representative hardware. p99 latency must be below the SLA threshold. A more accurate model that triples latency is usually not acceptable.
Size gate: Model artifact must be below a size limit (e.g., 500 MB for edge deployment, 10 GB for server deployment). Prevents accidentally deploying uncompressed or non-optimized models.
Data integrity gate: Verify that training data lineage is complete, the data version is recorded, and no sensitive data leaked into features.

# Example validation gate in Python
def validate_model(candidate_metrics, baseline_metrics, config):
    checks = {
        "accuracy": candidate_metrics["accuracy"] >= baseline_metrics["accuracy"] - config["accuracy_tolerance"],
        "latency_p99": candidate_metrics["latency_p99_ms"] <= config["max_latency_p99_ms"],
        "fairness": max(candidate_metrics["group_accuracies"]) - min(candidate_metrics["group_accuracies"]) <= config["max_fairness_gap"],
        "model_size": candidate_metrics["model_size_mb"] <= config["max_model_size_mb"],
    }

    passed = all(checks.values())
    failed = [k for k, v in checks.items() if not v]

    if not passed:
        alert_team(f"Model validation failed: {failed}")

    return passed, checks

Q5: How would you set up a GitHub Actions workflow for ML?

💡

Model Answer:

A GitHub Actions workflow for ML has multiple triggers and conditional jobs:

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
    paths: ['src/**', 'configs/**']  # Only trigger on code changes
  schedule:
    - cron: '0 6 * * 1'  # Weekly retraining on Mondays
  workflow_dispatch:  # Manual trigger for ad-hoc retraining
    inputs:
      reason:
        description: 'Reason for retraining'
        required: true

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: pytest tests/unit/ -v
      - name: Run data validation
        run: python scripts/validate_data.py

  train:
    needs: test
    runs-on: [self-hosted, gpu]  # GPU runner for training
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        run: python scripts/train.py --config configs/prod.yaml
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }}
      - name: Evaluate model
        run: python scripts/evaluate.py --run-id ${{ env.RUN_ID }}
      - name: Validate model
        run: python scripts/validate_gate.py --run-id ${{ env.RUN_ID }}

  deploy:
    needs: train
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Register model
        run: python scripts/register_model.py --run-id ${{ env.RUN_ID }}
      - name: Deploy canary
        run: python scripts/deploy_canary.py --model-version ${{ env.MODEL_VERSION }}

Key considerations: Use self-hosted runners with GPUs for training (GitHub-hosted runners do not have GPUs). Store secrets (API keys, model registry credentials) in GitHub Secrets. Use path filters so non-ML code changes do not trigger expensive training jobs. Set timeouts to prevent runaway training from burning budget.

Q6: What is training-serving skew and how do you prevent it?

💡

Model Answer:

Training-serving skew occurs when the data or feature computation at inference time differs from what the model saw during training. It is one of the most common causes of ML system failures and one of the hardest to debug.

Three types of skew:

Feature skew: The same feature is computed differently in training vs. serving. Example: training uses a Pandas-based normalization; serving uses a Java-based one with slightly different floating-point behavior. The predictions differ even for identical inputs.
Data distribution skew: The training data does not represent the production data distribution. Example: model trained on US-only data is deployed globally. Production inputs from other regions have patterns never seen during training.
Label leakage skew: Training data includes information that is not available at prediction time. Example: a fraud model trained on features that include "was_flagged_by_manual_review" — this feature does not exist when making a real-time prediction.

Prevention strategies:

Use a feature store (Feast, Tecton) that serves the same feature computation for both training and inference. Single source of truth for feature logic.
Write feature engineering as a shared library used by both training and serving pipelines. Never have two implementations of the same transformation.
Run integration tests that send the same input through both the training and serving pipelines and verify the feature vectors are identical.
Monitor feature distribution in production and compare against training distributions. Alert when a feature's mean or variance drifts beyond a threshold.

Q7: How do you ensure reproducibility in ML experiments?

💡

Model Answer:

Reproducibility means anyone can re-run an experiment and get the same (or very similar) results. This requires versioning everything:

Code: Pin the exact Git commit SHA. Tag releases. Never run experiments from uncommitted code.
Data: Use DVC or Delta Lake versioning. Record the exact dataset hash used for each experiment. Store data snapshots, not just pointers to mutable databases.
Environment: Pin all dependency versions in a lock file (pip freeze, conda lock). Use Docker containers to ensure identical environments across machines.
Random seeds: Set seeds for all sources of randomness (Python random, NumPy, PyTorch, CUDA). Note that GPU non-determinism means exact reproduction may require CPU-only training.
Hyperparameters: Store all hyperparameters in a config file (YAML/JSON) versioned alongside code. Never hardcode hyperparameters in training scripts.
Hardware: Document the GPU type, driver version, and CUDA version. Different GPU architectures can produce slightly different results due to floating-point operation ordering.

Tools: MLflow or W&B for experiment tracking (automatically log code version, parameters, metrics, artifacts). DVC for data versioning. Docker for environment reproducibility. Together, these let you answer: "How was this model produced?" for any model in your registry.

Q8: What is the difference between Kubeflow Pipelines, Apache Airflow, and Prefect for ML workflows?

💡

Model Answer:

Feature	Kubeflow Pipelines	Apache Airflow	Prefect
Designed For	ML-specific workflows	General data workflows	Modern data/ML workflows
Execution	Kubernetes-native (each step = pod)	Worker-based (Celery, K8s)	Hybrid (local, K8s, cloud)
ML Features	Experiment tracking, artifact management, built-in	None built-in, needs plugins	Basic, needs integrations
Learning Curve	High (requires K8s knowledge)	Medium (DAGs in Python)	Low (Pythonic API)
Scalability	Excellent (K8s-native)	Good (with K8s executor)	Good (with cloud runners)
Best For	Teams already on K8s doing heavy ML	Mixed data + ML workloads	Small teams, rapid iteration

Recommendation: Use Kubeflow Pipelines if your team is already on Kubernetes and runs many ML experiments. Use Airflow if you have a mix of data engineering and ML workflows and your team already knows it. Use Prefect if you want a modern, Pythonic experience with less infrastructure overhead. For managed services, consider Vertex AI Pipelines (Google) or SageMaker Pipelines (AWS) to avoid managing orchestration infrastructure entirely.

Q9: How do you handle secrets and credentials in ML pipelines?

💡

Model Answer:

Secret management in ML pipelines requires extra care because pipelines often run on shared infrastructure and produce artifacts that are distributed widely.

Never hardcode secrets. Not in code, not in config files, not in Dockerfiles, not in Jupyter notebooks. Even if the repo is private — credentials in Git history last forever.
Use a secrets manager: AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, or Kubernetes Secrets. Inject secrets as environment variables at runtime, not at build time.
Scoped access: Each pipeline step should only have access to the secrets it needs. The training step needs the data warehouse credential but not the deployment credential. Use service accounts with least-privilege IAM roles.
Rotation: Rotate credentials automatically on a schedule (90 days is common). Your pipeline must not break when credentials rotate — use dynamic credential lookup, not cached values.
Audit logging: Log which pipeline, which user, and which service account accessed which secret and when. This is required for SOC 2 and HIPAA compliance.

Common mistake: Data scientists hard-coding API keys in Jupyter notebooks that get committed to Git. Set up pre-commit hooks (e.g., detect-secrets) that scan for credentials before allowing commits. Also scan Docker images for leaked secrets before pushing to registries.

Q10: How do you trigger model retraining? What are the common triggers?

💡

Model Answer:

Model retraining can be triggered by several events. The right trigger depends on your domain and how quickly data distribution changes:

Scheduled retraining: Retrain on a fixed schedule (daily, weekly, monthly). Simple to implement and reason about. Works well when data distribution is relatively stable. Common in recommendation systems (weekly retrain with fresh user interaction data).
Performance-based trigger: Monitor model performance metrics (accuracy, F1, revenue impact) in production. When metrics drop below a threshold, trigger retraining. Requires ground truth labels, which may arrive with a delay (fraud labels arrive days later).
Drift-based trigger: Monitor input data distribution and trigger retraining when statistical tests detect significant drift (PSI > 0.2, KS test p-value < 0.05). Catches problems before performance degrades. Does not require ground truth labels.
Data volume trigger: Retrain when a threshold of new labeled data is available (e.g., 10,000 new labeled examples). Common in active learning systems where models request labels for uncertain predictions.
Event-based trigger: External events that invalidate the current model: new product launch, market crash, regulatory change, competitor action. Requires human judgment to identify but is critical for domains like finance and e-commerce.

Best practice: Combine scheduled retraining (as a safety net) with drift-based triggers (for responsiveness). If drift is detected between scheduled retrains, trigger an immediate retrain. If no drift is detected, the scheduled retrain still runs to incorporate new data gradually.

← Previous Model Deployment Questions Next → Monitoring & Observability