Testing Feature Engineering
Verifying feature engineering logic with automated tests. Part of the Unit Testing for ML Pipelines course at AI School by Lilly Tech Systems.
Feature Engineering: The Most Bug-Prone ML Code
Feature engineering transforms raw data into features the model can learn from. It typically involves complex business logic, mathematical transformations, and aggregations. These functions are where the majority of data processing bugs hide. A miscalculated feature can cause model performance to drop significantly, and the cause is often extremely hard to trace without good tests.
Testing Individual Feature Functions
Each feature engineering function should have its own test suite. Test with known inputs and verify exact outputs:
import pandas as pd
import numpy as np
import pytest
def compute_recency_score(last_purchase_date, reference_date):
# Compute a recency score from 0 to 1 based on days since last purchase.
days_since = (reference_date - last_purchase_date).dt.days
max_days = 365
score = 1.0 - (days_since.clip(0, max_days) / max_days)
return score
def compute_frequency_features(transactions_df, user_id_col='user_id'):
# Compute purchase frequency features per user.
freq = transactions_df.groupby(user_id_col).agg(
total_purchases=('transaction_id', 'count'),
avg_purchase_value=('amount', 'mean'),
total_spend=('amount', 'sum'),
purchase_std=('amount', 'std')
).reset_index()
freq['purchase_std'] = freq['purchase_std'].fillna(0)
return freq
class TestRecencyScore:
def test_recent_purchase_scores_high(self):
dates = pd.Series(pd.to_datetime(['2024-12-01']))
ref = pd.Timestamp('2024-12-05')
score = compute_recency_score(dates, ref)
assert score.iloc[0] > 0.95
def test_old_purchase_scores_low(self):
dates = pd.Series(pd.to_datetime(['2023-01-01']))
ref = pd.Timestamp('2024-12-05')
score = compute_recency_score(dates, ref)
assert score.iloc[0] < 0.1
def test_score_range(self):
dates = pd.Series(pd.to_datetime(['2024-01-01', '2024-06-01', '2024-12-01']))
ref = pd.Timestamp('2024-12-31')
scores = compute_recency_score(dates, ref)
assert scores.between(0, 1).all()
class TestFrequencyFeatures:
def test_correct_aggregation(self):
df = pd.DataFrame({
'user_id': [1, 1, 1, 2, 2],
'transaction_id': ['t1', 't2', 't3', 't4', 't5'],
'amount': [100, 200, 300, 50, 150]
})
result = compute_frequency_features(df)
user1 = result[result['user_id'] == 1].iloc[0]
assert user1['total_purchases'] == 3
assert user1['avg_purchase_value'] == pytest.approx(200.0)
assert user1['total_spend'] == pytest.approx(600.0)
def test_single_purchase_std_is_zero(self):
df = pd.DataFrame({
'user_id': [1],
'transaction_id': ['t1'],
'amount': [100]
})
result = compute_frequency_features(df)
assert result['purchase_std'].iloc[0] == 0
pytest.approx() for floating-point comparisons. Exact equality checks on floats will fail due to floating-point arithmetic. For example, assert 0.1 + 0.2 == pytest.approx(0.3) passes while assert 0.1 + 0.2 == 0.3 fails.Testing Feature Interactions
When features depend on each other or must be computed in a specific order, test the interactions explicitly. Create integration-style tests that run the full feature engineering pipeline on sample data and verify the output schema, data types, and value ranges:
def test_full_feature_pipeline_output_schema():
raw_data = create_sample_raw_data()
features = run_feature_pipeline(raw_data)
expected_columns = [
'user_id', 'recency_score', 'frequency_score',
'monetary_score', 'avg_session_duration',
'days_since_signup', 'is_premium'
]
for col in expected_columns:
assert col in features.columns, f"Missing column: {col}"
assert features['recency_score'].between(0, 1).all()
assert features['frequency_score'].ge(0).all()
assert features['is_premium'].isin([0, 1]).all()
assert not features.isnull().any().any(), "Features contain null values"
Testing Feature Consistency
Feature values should be consistent between training and serving. A common bug is computing features differently in the training pipeline versus the serving pipeline. Create tests that verify feature computation produces identical results in both contexts.
Snapshot Testing for Features
Snapshot testing captures the expected output for a fixed input and compares against it in future runs. This catches unintended changes to feature computation. Store snapshot data as JSON or CSV files in your test directory and compare computed features against them.
Testing for Feature Leakage
Feature leakage occurs when information from the target variable or future data sneaks into features. Test for this by verifying that features computed for time T only use data available at time T, and that no feature has suspiciously high correlation with the target.
Lilly Tech Systems