Beginner

Test Design for ML Models

Designing effective test cases for machine learning models. Part of the AI Model Testing Fundamentals course at AI School by Lilly Tech Systems.

Principles of ML Test Design

Designing tests for machine learning models requires a different mindset than traditional software testing. In conventional software, you define exact expected outputs. In ML testing, you define behavioral expectations, statistical thresholds, and invariance properties. Your tests must account for the inherent stochasticity of ML systems while still providing meaningful quality guarantees.

Good ML test design follows several core principles: tests should be deterministic when possible (using fixed random seeds), they should test behavior rather than exact outputs, they should cover the full input distribution, and they should include edge cases that the model might encounter in production.

Categories of ML Tests

Smoke Tests

Smoke tests verify that the model can load, accept input, and produce output without crashing. They are the most basic form of testing and should run first in your test suite:

def test_model_loads_successfully():
    model = load_model("models/classifier_v2.pkl")
    assert model is not None

def test_model_accepts_valid_input():
    model = load_model("models/classifier_v2.pkl")
    sample = np.random.randn(1, 20)  # 20 features
    prediction = model.predict(sample)
    assert prediction is not None
    assert len(prediction) == 1

def test_model_handles_batch_input():
    model = load_model("models/classifier_v2.pkl")
    batch = np.random.randn(100, 20)
    predictions = model.predict(batch)
    assert len(predictions) == 100

Invariance Tests

Invariance tests check that the model's output does not change when inputs are perturbed in ways that should not affect the prediction. For example, a sentiment classifier should produce the same result regardless of whether a person's name is John or Maria:

  • Directional invariance — Increasing a feature known to positively correlate with the target should increase the prediction
  • Permutation invariance — For set-based inputs, order should not matter
  • Demographic invariance — Protected attributes should not change predictions

Minimum Functionality Tests

These tests verify that the model can correctly handle simple, unambiguous cases. If a sentiment model cannot correctly classify "This product is absolutely terrible and I want my money back" as negative, something is fundamentally wrong:

@pytest.mark.parametrize("text,expected", [
    ("This is the best product I have ever used!", "positive"),
    ("Terrible quality, broke after one day", "negative"),
    ("It works as described, nothing special", "neutral"),
])
def test_obvious_sentiment_cases(sentiment_model, text, expected):
    prediction = sentiment_model.predict(text)
    assert prediction == expected, (
        f"Failed on obvious case: '{text}' predicted as {prediction}"
    )
💡
Best practice: Create a curated set of 20-50 "golden examples" that represent clear-cut cases your model must get right. These serve as your minimum functionality test suite and catch catastrophic regressions immediately.

Test Data Management

Managing test data for ML is more complex than for traditional software. You need multiple datasets for different purposes:

  1. Holdout test set — A representative sample never seen during training, used for final model evaluation
  2. Slice-specific sets — Subsets targeting specific demographics, input types, or edge cases
  3. Adversarial examples — Inputs specifically crafted to challenge the model
  4. Regression benchmarks — Fixed datasets used to compare model versions over time
  5. Production samples — Real-world data collected from production to test against actual usage patterns

Data Versioning for Tests

Test datasets should be versioned alongside your code. Tools like DVC (Data Version Control) help manage test data versioning. When your test data changes, your test results are no longer comparable to previous runs, so versioning is essential for reproducibility.

Designing Tests for Different Model Types

Different model types require different testing approaches:

  • Classification models — Test accuracy, precision, recall per class, confusion matrix patterns, and threshold sensitivity
  • Regression models — Test MAE, RMSE, residual distributions, and prediction bounds
  • Ranking models — Test NDCG, MAP, pairwise ordering correctness
  • Generative models — Test output quality metrics, diversity, and coherence
  • Time series models — Test with proper temporal splits, forecast horizons, and seasonal patterns

Test Organization and Naming

Organize your ML tests into clear categories with descriptive names. A well-organized test suite makes it easy to identify what failed and why. Use a consistent naming convention that includes the test category, the property being tested, and the expected behavior. Group tests by speed: fast unit tests should run on every commit, slower integration tests on every PR, and full model evaluation tests on a schedule.

Common mistake: Do not use your training data for testing. This is data leakage and will give you falsely optimistic test results. Always maintain strict separation between training and test data.