Beginner

Testing Data Transformations

Writing tests for data preprocessing and transformation logic. Part of the Unit Testing for ML Pipelines course at AI School by Lilly Tech Systems.

Why Data Transformations Need Tests

Data transformations are the backbone of every ML pipeline. They convert raw data into the format your model expects. A bug in a transformation function can silently corrupt your training data, producing a model that appears to work but makes systematically wrong predictions. These bugs are notoriously hard to detect without explicit tests because the model still produces output — just the wrong output.

Testing Numerical Transformations

Numerical transformations include scaling, normalization, binning, and mathematical operations. Test these by verifying output properties:

import numpy as np
import pandas as pd
import pytest

def normalize_column(series, method='minmax'):
    # Normalize a pandas Series using the specified method.
    if method == 'minmax':
        min_val = series.min()
        max_val = series.max()
        if max_val == min_val:
            return pd.Series(0.0, index=series.index)
        return (series - min_val) / (max_val - min_val)
    elif method == 'zscore':
        mean = series.mean()
        std = series.std()
        if std == 0:
            return pd.Series(0.0, index=series.index)
        return (series - mean) / std
    raise ValueError(f"Unknown method: {method}")

class TestNormalizeColumn:
    def test_minmax_range(self):
        s = pd.Series([10, 20, 30, 40, 50])
        result = normalize_column(s, method='minmax')
        assert result.min() == pytest.approx(0.0)
        assert result.max() == pytest.approx(1.0)

    def test_minmax_preserves_order(self):
        s = pd.Series([5, 1, 9, 3])
        result = normalize_column(s, method='minmax')
        assert list(result.argsort()) == list(s.argsort())

    def test_minmax_constant_column(self):
        s = pd.Series([7, 7, 7, 7])
        result = normalize_column(s, method='minmax')
        assert (result == 0.0).all()

    def test_zscore_mean_and_std(self):
        s = pd.Series([10, 20, 30, 40, 50])
        result = normalize_column(s, method='zscore')
        assert result.mean() == pytest.approx(0.0, abs=1e-10)
        assert result.std() == pytest.approx(1.0, abs=1e-10)

    def test_invalid_method_raises(self):
        s = pd.Series([1, 2, 3])
        with pytest.raises(ValueError, match="Unknown method"):
            normalize_column(s, method='invalid')

Testing Categorical Transformations

Categorical encoding is a common source of subtle bugs, especially when new categories appear in production that were not seen during training:

def encode_categories(df, column, known_categories):
    # One-hot encode a column, handling unknown categories.
    df = df.copy()
    df[column] = df[column].where(df[column].isin(known_categories), 'unknown')
    dummies = pd.get_dummies(df[column], prefix=column)
    # Ensure all expected columns exist
    for cat in known_categories + ['unknown']:
        col_name = f"{column}_{cat}"
        if col_name not in dummies.columns:
            dummies[col_name] = 0
    return dummies

def test_encode_handles_unknown_category():
    df = pd.DataFrame({'color': ['red', 'green', 'purple']})
    known = ['red', 'green', 'blue']
    result = encode_categories(df, 'color', known)
    assert 'color_unknown' in result.columns
    assert result['color_unknown'].sum() == 1  # purple mapped to unknown

def test_encode_creates_all_expected_columns():
    df = pd.DataFrame({'color': ['red']})
    known = ['red', 'green', 'blue']
    result = encode_categories(df, 'color', known)
    for cat in known + ['unknown']:
        assert f"color_{cat}" in result.columns

💡

Critical test: Always test what happens when your transformation receives unexpected input. In production, you will encounter null values, empty strings, new categories, extreme values, and malformed data. Your tests should cover these cases explicitly.

Testing Date and Time Transformations

Date transformations are error-prone due to timezone handling, format variations, and edge cases like daylight saving time transitions and leap years:

Test timezone conversion between UTC and local time
Test date parsing with multiple input formats
Test feature extraction (day of week, month, hour, is_weekend)
Test handling of null and invalid dates
Test leap year and end-of-month edge cases

Testing Text Transformations

Text preprocessing for NLP models requires careful testing of tokenization, lowercasing, stop word removal, and special character handling. Test with diverse inputs including Unicode characters, empty strings, very long texts, and texts in unexpected encodings.

Property-Based Testing for Transformations

Property-based testing with Hypothesis generates random inputs to test that your transformations maintain expected properties across many different inputs, catching edge cases you might not think to test manually:

from hypothesis import given, strategies as st

@given(st.lists(st.floats(min_value=-1e6, max_value=1e6,
                           allow_nan=False, allow_infinity=False),
                min_size=2))
def test_minmax_always_in_range(values):
    s = pd.Series(values)
    if s.nunique() > 1:  # Skip constant series
        result = normalize_column(s, method='minmax')
        assert result.min() >= -1e-10  # Allow tiny floating point errors
        assert result.max() <= 1.0 + 1e-10

⚠

Reminder: Always test that your transformations are idempotent when they should be, and that they do not modify the original DataFrame. Use df.copy() at the start of transformation functions and test that the original data is unchanged.

← Previous Setting Up pytest for ML Next → Testing Feature Engineering