Testing Data Transformations
Writing tests for data preprocessing and transformation logic. Part of the Unit Testing for ML Pipelines course at AI School by Lilly Tech Systems.
Why Data Transformations Need Tests
Data transformations are the backbone of every ML pipeline. They convert raw data into the format your model expects. A bug in a transformation function can silently corrupt your training data, producing a model that appears to work but makes systematically wrong predictions. These bugs are notoriously hard to detect without explicit tests because the model still produces output — just the wrong output.
Testing Numerical Transformations
Numerical transformations include scaling, normalization, binning, and mathematical operations. Test these by verifying output properties:
import numpy as np
import pandas as pd
import pytest
def normalize_column(series, method='minmax'):
# Normalize a pandas Series using the specified method.
if method == 'minmax':
min_val = series.min()
max_val = series.max()
if max_val == min_val:
return pd.Series(0.0, index=series.index)
return (series - min_val) / (max_val - min_val)
elif method == 'zscore':
mean = series.mean()
std = series.std()
if std == 0:
return pd.Series(0.0, index=series.index)
return (series - mean) / std
raise ValueError(f"Unknown method: {method}")
class TestNormalizeColumn:
def test_minmax_range(self):
s = pd.Series([10, 20, 30, 40, 50])
result = normalize_column(s, method='minmax')
assert result.min() == pytest.approx(0.0)
assert result.max() == pytest.approx(1.0)
def test_minmax_preserves_order(self):
s = pd.Series([5, 1, 9, 3])
result = normalize_column(s, method='minmax')
assert list(result.argsort()) == list(s.argsort())
def test_minmax_constant_column(self):
s = pd.Series([7, 7, 7, 7])
result = normalize_column(s, method='minmax')
assert (result == 0.0).all()
def test_zscore_mean_and_std(self):
s = pd.Series([10, 20, 30, 40, 50])
result = normalize_column(s, method='zscore')
assert result.mean() == pytest.approx(0.0, abs=1e-10)
assert result.std() == pytest.approx(1.0, abs=1e-10)
def test_invalid_method_raises(self):
s = pd.Series([1, 2, 3])
with pytest.raises(ValueError, match="Unknown method"):
normalize_column(s, method='invalid')
Testing Categorical Transformations
Categorical encoding is a common source of subtle bugs, especially when new categories appear in production that were not seen during training:
def encode_categories(df, column, known_categories):
# One-hot encode a column, handling unknown categories.
df = df.copy()
df[column] = df[column].where(df[column].isin(known_categories), 'unknown')
dummies = pd.get_dummies(df[column], prefix=column)
# Ensure all expected columns exist
for cat in known_categories + ['unknown']:
col_name = f"{column}_{cat}"
if col_name not in dummies.columns:
dummies[col_name] = 0
return dummies
def test_encode_handles_unknown_category():
df = pd.DataFrame({'color': ['red', 'green', 'purple']})
known = ['red', 'green', 'blue']
result = encode_categories(df, 'color', known)
assert 'color_unknown' in result.columns
assert result['color_unknown'].sum() == 1 # purple mapped to unknown
def test_encode_creates_all_expected_columns():
df = pd.DataFrame({'color': ['red']})
known = ['red', 'green', 'blue']
result = encode_categories(df, 'color', known)
for cat in known + ['unknown']:
assert f"color_{cat}" in result.columns
Testing Date and Time Transformations
Date transformations are error-prone due to timezone handling, format variations, and edge cases like daylight saving time transitions and leap years:
- Test timezone conversion between UTC and local time
- Test date parsing with multiple input formats
- Test feature extraction (day of week, month, hour, is_weekend)
- Test handling of null and invalid dates
- Test leap year and end-of-month edge cases
Testing Text Transformations
Text preprocessing for NLP models requires careful testing of tokenization, lowercasing, stop word removal, and special character handling. Test with diverse inputs including Unicode characters, empty strings, very long texts, and texts in unexpected encodings.
Property-Based Testing for Transformations
Property-based testing with Hypothesis generates random inputs to test that your transformations maintain expected properties across many different inputs, catching edge cases you might not think to test manually:
from hypothesis import given, strategies as st
@given(st.lists(st.floats(min_value=-1e6, max_value=1e6,
allow_nan=False, allow_infinity=False),
min_size=2))
def test_minmax_always_in_range(values):
s = pd.Series(values)
if s.nunique() > 1: # Skip constant series
result = normalize_column(s, method='minmax')
assert result.min() >= -1e-10 # Allow tiny floating point errors
assert result.max() <= 1.0 + 1e-10
df.copy() at the start of transformation functions and test that the original data is unchanged.
Lilly Tech Systems