Why Unit Test ML Code
The case for unit testing in machine learning projects. Part of the Unit Testing for ML Pipelines course at AI School by Lilly Tech Systems.
The Testing Gap in Machine Learning
Most machine learning code is written in notebooks without a single test. Data scientists focus on model accuracy and treat the surrounding code as disposable glue. This approach works for experimentation but creates serious problems when ML systems move to production. Bugs in data preprocessing, feature engineering, and pipeline orchestration are the leading cause of production ML failures, and they are exactly the kind of bugs that unit tests catch.
Unit testing ML code is not about testing whether your model is accurate. That is the job of model evaluation. Unit testing verifies that the individual functions and components in your ML pipeline work correctly: data transformations produce the expected output, feature engineering logic handles edge cases, and training utilities behave as specified.
What to Unit Test in ML Pipelines
Not everything in an ML project needs unit tests. Focus your testing effort on code that is deterministic and can be verified with exact assertions:
- Data loading and parsing — Verify that your data loaders correctly read CSV, JSON, Parquet, and other formats
- Data cleaning functions — Test null handling, type conversions, outlier removal, and deduplication
- Feature engineering — Verify that each feature transformation produces the correct output for known inputs
- Data validation — Test schema validation, range checks, and constraint enforcement
- Utility functions — Test helper functions for metrics calculation, data splitting, and configuration parsing
- Pipeline orchestration — Test that pipeline steps execute in the correct order with proper data flow
What NOT to Unit Test
Some aspects of ML do not belong in unit tests:
- Model accuracy (use model evaluation and integration tests instead)
- Training convergence (use training monitoring and smoke tests)
- Hyperparameter choices (use experiment tracking and cross-validation)
The ROI of ML Unit Tests
Teams that adopt unit testing for their ML code report several concrete benefits:
- Faster debugging — When a pipeline fails, unit tests isolate the failing component immediately instead of hunting through logs
- Safer refactoring — You can confidently restructure code knowing that tests will catch regressions
- Better collaboration — Tests serve as documentation for how functions should behave, making it easier for new team members to understand the codebase
- Fewer production incidents — Catching data processing bugs before deployment prevents costly production failures
# Example: Testing a data cleaning function
import pandas as pd
import pytest
def clean_user_data(df):
# Clean raw user data for model training.
df = df.copy()
df['age'] = df['age'].clip(0, 120)
df['email'] = df['email'].str.lower().str.strip()
df = df.dropna(subset=['user_id'])
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')
return df
def test_clean_user_data_clips_age():
df = pd.DataFrame({'user_id': [1, 2], 'age': [-5, 200],
'email': ['A@B.com', 'c@d.com'],
'signup_date': ['2024-01-01', '2024-01-02']})
result = clean_user_data(df)
assert result['age'].min() >= 0
assert result['age'].max() <= 120
def test_clean_user_data_lowercases_email():
df = pd.DataFrame({'user_id': [1], 'age': [25],
'email': [' Test@Example.COM '],
'signup_date': ['2024-01-01']})
result = clean_user_data(df)
assert result['email'].iloc[0] == 'test@example.com'
def test_clean_user_data_drops_null_user_id():
df = pd.DataFrame({'user_id': [1, None], 'age': [25, 30],
'email': ['a@b.com', 'c@d.com'],
'signup_date': ['2024-01-01', '2024-01-02']})
result = clean_user_data(df)
assert len(result) == 1
Overcoming Common Objections
Data scientists often resist unit testing with arguments like "ML is experimental" or "tests slow me down." The reality is that unit tests make experimentation faster by catching bugs early. The time spent writing tests is repaid many times over in reduced debugging time. Start small: add tests for the functions that have caused production issues in the past, then expand coverage gradually.
Lilly Tech Systems