Intermediate

Testing Feature Engineering

Verifying feature engineering logic with automated tests. Part of the Unit Testing for ML Pipelines course at AI School by Lilly Tech Systems.

Feature Engineering: The Most Bug-Prone ML Code

Feature engineering transforms raw data into features the model can learn from. It typically involves complex business logic, mathematical transformations, and aggregations. These functions are where the majority of data processing bugs hide. A miscalculated feature can cause model performance to drop significantly, and the cause is often extremely hard to trace without good tests.

Testing Individual Feature Functions

Each feature engineering function should have its own test suite. Test with known inputs and verify exact outputs:

import pandas as pd
import numpy as np
import pytest

def compute_recency_score(last_purchase_date, reference_date):
    # Compute a recency score from 0 to 1 based on days since last purchase.
    days_since = (reference_date - last_purchase_date).dt.days
    max_days = 365
    score = 1.0 - (days_since.clip(0, max_days) / max_days)
    return score

def compute_frequency_features(transactions_df, user_id_col='user_id'):
    # Compute purchase frequency features per user.
    freq = transactions_df.groupby(user_id_col).agg(
        total_purchases=('transaction_id', 'count'),
        avg_purchase_value=('amount', 'mean'),
        total_spend=('amount', 'sum'),
        purchase_std=('amount', 'std')
    ).reset_index()
    freq['purchase_std'] = freq['purchase_std'].fillna(0)
    return freq

class TestRecencyScore:
    def test_recent_purchase_scores_high(self):
        dates = pd.Series(pd.to_datetime(['2024-12-01']))
        ref = pd.Timestamp('2024-12-05')
        score = compute_recency_score(dates, ref)
        assert score.iloc[0] > 0.95

    def test_old_purchase_scores_low(self):
        dates = pd.Series(pd.to_datetime(['2023-01-01']))
        ref = pd.Timestamp('2024-12-05')
        score = compute_recency_score(dates, ref)
        assert score.iloc[0] < 0.1

    def test_score_range(self):
        dates = pd.Series(pd.to_datetime(['2024-01-01', '2024-06-01', '2024-12-01']))
        ref = pd.Timestamp('2024-12-31')
        scores = compute_recency_score(dates, ref)
        assert scores.between(0, 1).all()

class TestFrequencyFeatures:
    def test_correct_aggregation(self):
        df = pd.DataFrame({
            'user_id': [1, 1, 1, 2, 2],
            'transaction_id': ['t1', 't2', 't3', 't4', 't5'],
            'amount': [100, 200, 300, 50, 150]
        })
        result = compute_frequency_features(df)
        user1 = result[result['user_id'] == 1].iloc[0]
        assert user1['total_purchases'] == 3
        assert user1['avg_purchase_value'] == pytest.approx(200.0)
        assert user1['total_spend'] == pytest.approx(600.0)

    def test_single_purchase_std_is_zero(self):
        df = pd.DataFrame({
            'user_id': [1],
            'transaction_id': ['t1'],
            'amount': [100]
        })
        result = compute_frequency_features(df)
        assert result['purchase_std'].iloc[0] == 0

💡

Testing tip: Use pytest.approx() for floating-point comparisons. Exact equality checks on floats will fail due to floating-point arithmetic. For example, assert 0.1 + 0.2 == pytest.approx(0.3) passes while assert 0.1 + 0.2 == 0.3 fails.

Testing Feature Interactions

When features depend on each other or must be computed in a specific order, test the interactions explicitly. Create integration-style tests that run the full feature engineering pipeline on sample data and verify the output schema, data types, and value ranges:

def test_full_feature_pipeline_output_schema():
    raw_data = create_sample_raw_data()
    features = run_feature_pipeline(raw_data)

    expected_columns = [
        'user_id', 'recency_score', 'frequency_score',
        'monetary_score', 'avg_session_duration',
        'days_since_signup', 'is_premium'
    ]
    for col in expected_columns:
        assert col in features.columns, f"Missing column: {col}"

    assert features['recency_score'].between(0, 1).all()
    assert features['frequency_score'].ge(0).all()
    assert features['is_premium'].isin([0, 1]).all()
    assert not features.isnull().any().any(), "Features contain null values"

Testing Feature Consistency

Feature values should be consistent between training and serving. A common bug is computing features differently in the training pipeline versus the serving pipeline. Create tests that verify feature computation produces identical results in both contexts.

Snapshot Testing for Features

Snapshot testing captures the expected output for a fixed input and compares against it in future runs. This catches unintended changes to feature computation. Store snapshot data as JSON or CSV files in your test directory and compare computed features against them.

Testing for Feature Leakage

Feature leakage occurs when information from the target variable or future data sneaks into features. Test for this by verifying that features computed for time T only use data available at time T, and that no feature has suspiciously high correlation with the target.

⚠

Warning: Feature engineering bugs are among the hardest to detect because the pipeline still runs and the model still produces predictions. The only reliable way to catch them is through explicit, targeted tests. Do not rely on model performance metrics alone to detect feature bugs.

← Previous Testing Data Transformations Next → Testing Model Training Functions