Intermediate

CatBoost

Explore Yandex's CatBoost framework with ordered boosting to prevent target leakage, best-in-class categorical feature handling, and strong defaults that require minimal tuning.

Why CatBoost?

CatBoost (Categorical Boosting) was developed by Yandex with two key innovations: ordered boosting to prevent target leakage during training, and a sophisticated categorical encoding scheme that avoids the pitfalls of one-hot or label encoding.

Basic Usage

Python
from catboost import CatBoostClassifier

cat_features = ["city", "product_type", "os"]

model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    cat_features=cat_features,
    eval_metric="AUC",
    early_stopping_rounds=50,
    verbose=100
)

model.fit(
    X_train, y_train,
    eval_set=(X_test, y_test)
)

# No need to encode categoricals - CatBoost handles them!
print(f"Test AUC: {model.score(X_test, y_test):.4f}")

Ordered Boosting

Traditional gradient boosting has a subtle target leakage problem: the same data used to compute residuals is also used to fit trees. CatBoost solves this with ordered boosting:

  • Data is randomly permuted at the start of training.
  • For each example, residuals are computed using only the examples that appear before it in the permutation.
  • This prevents the model from "seeing" the target value of the current example during residual computation.
Less Overfitting: Ordered boosting makes CatBoost more robust to overfitting out of the box, especially on small datasets. It often performs well with default parameters.

Categorical Feature Handling

Python
# CatBoost uses "ordered target statistics" for categoricals
# For each category value, it computes:
# encoding = (count_in_class + prior) / (total_count + 1)
# But uses only preceding examples (ordered) to prevent leakage

# Supports combinations of categorical features automatically
model = CatBoostClassifier(
    cat_features=[0, 3, 7],  # Can use column indices
    max_ctr_complexity=2,   # Max categorical feature combinations
)

# Works with string values directly - no preprocessing needed!
# X can contain: ["New York", "Premium", "iOS"]

Framework Comparison

FeatureXGBoostLightGBMCatBoost
Speed (large data)GoodFastestGood
Categorical handlingManual encodingGood (native)Best (ordered)
Default performanceNeeds tuningNeeds tuningGood out of box
Overfitting resistanceRegularizationRegularizationOrdered boosting
GPU supportYesYesExcellent
Ranking tasksGoodGoodExcellent (YetiRank)

Next: Tuning

Learn advanced hyperparameter tuning techniques including Bayesian optimization with Optuna.

Next: Tuning →