Advanced

Best Practices

Building a model is only half the battle. Selecting the right model, validating it properly, engineering good features, and interpreting results correctly are what separate good modeling from great modeling.

Model Selection Criteria

When comparing models, you need objective criteria that balance fit quality with model complexity.

Criterion Formula Concept When to Use
AIC (Akaike Information Criterion) -2 × log-likelihood + 2k Comparing models on the same dataset. Lower is better. Favors predictive accuracy.
BIC (Bayesian Information Criterion) -2 × log-likelihood + k × ln(n) Similar to AIC but penalizes complexity more heavily. Prefers simpler models.
Adjusted R² Adjusts R² for number of predictors Comparing regression models with different numbers of predictors.
Cross-validation score Average performance across K folds Estimating how well a model generalizes to unseen data.
Python
import statsmodels.api as sm

# Compare models using AIC and BIC
X1 = sm.add_constant(df[['experience']])
X2 = sm.add_constant(df[['experience', 'education']])
X3 = sm.add_constant(df[['experience', 'education', 'age']])

for i, X in enumerate([X1, X2, X3], 1):
    model = sm.OLS(y, X).fit()
    print(f"Model {i}: AIC={model.aic:.1f}, BIC={model.bic:.1f}, "
          f"Adj R²={model.rsquared_adj:.4f}")

Overfitting vs Underfitting

📈

Overfitting

Model is too complex — it memorizes training data (including noise) but performs poorly on new data. High training accuracy, low test accuracy.

📉

Underfitting

Model is too simple — it fails to capture important patterns. Low accuracy on both training and test data.

Good Fit

Model captures the true pattern without memorizing noise. Similar performance on training and test data.

Signs of overfitting: Large gap between training and test performance, very high R² on training data (>0.99), model coefficients are unreasonably large, model performance degrades with new data.

How to prevent overfitting:

  • Use cross-validation instead of a single train/test split
  • Apply regularization (Ridge, Lasso, ElasticNet)
  • Reduce the number of features (feature selection)
  • Collect more training data
  • Set max_depth and min_samples for tree-based models

Feature Engineering

Feature engineering is the art of creating informative features from raw data. Good features can matter more than choosing the right algorithm.

Python
import pandas as pd
import numpy as np

# Create new features from existing ones
df['income_per_member'] = df['income'] / df['family_size']
df['debt_to_income'] = df['total_debt'] / df['income']

# Extract from datetime
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Binning continuous variables
df['age_group'] = pd.cut(df['age'],
    bins=[0, 25, 35, 50, 65, 100],
    labels=['18-25', '26-35', '36-50', '51-65', '65+'])

# Log transform skewed features
df['log_income'] = np.log1p(df['income'])

# One-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=['department'], drop_first=True)

Model Validation

  1. Train/Test Split

    The simplest approach: hold out 20-30% of data for testing. Quick but can be unreliable with small datasets.

  2. K-Fold Cross-Validation

    Split data into K folds, train on K-1 folds, test on the remaining fold. Repeat K times. More reliable than a single split.

  3. Stratified K-Fold

    Ensures each fold has the same class distribution as the full dataset. Essential for imbalanced classification problems.

  4. Time Series Split

    For time series data, always use temporal splits — train on past data, test on future data. Never shuffle time series data.

Interpreting Results

A model is only useful if you can explain what it tells you. Key principles:

  • Report effect sizes, not just p-values. A statistically significant result can be practically meaningless if the effect is tiny.
  • Use confidence intervals to show the range of plausible values, not just point estimates.
  • Check assumptions before trusting results. Violated assumptions invalidate conclusions.
  • Consider practical significance. A model that improves prediction by 0.1% may not justify its complexity.
  • Be transparent about limitations. Every model has them — acknowledging them builds trust.

Common Pitfalls

Data leakage: When information from outside the training dataset "leaks" into the model during training. This artificially inflates performance but fails in production. Common causes: using future data to predict the past, including the target variable (or a proxy) as a feature, fitting scalers on the full dataset before splitting.
  • p-hacking: Testing many hypotheses until you find a significant result by chance. Always correct for multiple comparisons.
  • Ignoring class imbalance: Using accuracy on imbalanced datasets. Use precision, recall, F1, or AUC instead.
  • Correlation as causation: Two variables can be correlated without one causing the other. Experimental design is needed for causal claims.
  • Extrapolation: Applying a model outside the range of data it was trained on. Linear models can produce absurd predictions when extrapolated.
  • Ignoring domain knowledge: A model that contradicts well-established domain knowledge likely has a problem, even if its metrics look good.

Frequently Asked Questions

AIC tends to select more complex models and is better for prediction. BIC penalizes complexity more and tends to select simpler models, making it better for identifying the "true" model. If your goal is prediction, use AIC. If your goal is understanding which variables matter, use BIC. When in doubt, report both.

It depends entirely on the field. In physics, R² > 0.99 is expected. In social sciences, R² > 0.3 can be excellent. In finance, even R² > 0.05 can be profitable. Focus on whether your model is useful for its intended purpose rather than chasing a specific R² threshold.

Use statistical models (regression, ANOVA, etc.) when you need to understand and explain relationships, test specific hypotheses, or quantify uncertainty with confidence intervals. Use machine learning when prediction accuracy is the primary goal, you have large datasets, or the relationships are too complex for traditional models. Many modern approaches blend both.

A common rule of thumb is at least 10-20 observations per feature for linear models. With too many features relative to observations, you risk overfitting. Use feature selection techniques (Lasso, forward/backward selection, mutual information) and domain knowledge to keep only the most informative features.

Options include: transforming variables (log, square root), using a different model that does not make those assumptions (e.g., tree-based models do not assume normality), using robust standard errors, or using non-parametric tests. The choice depends on which assumption is violated and how severely.