Intermediate
CatBoost
Explore Yandex's CatBoost framework with ordered boosting to prevent target leakage, best-in-class categorical feature handling, and strong defaults that require minimal tuning.
Why CatBoost?
CatBoost (Categorical Boosting) was developed by Yandex with two key innovations: ordered boosting to prevent target leakage during training, and a sophisticated categorical encoding scheme that avoids the pitfalls of one-hot or label encoding.
Basic Usage
Python
from catboost import CatBoostClassifier cat_features = ["city", "product_type", "os"] model = CatBoostClassifier( iterations=1000, learning_rate=0.1, depth=6, cat_features=cat_features, eval_metric="AUC", early_stopping_rounds=50, verbose=100 ) model.fit( X_train, y_train, eval_set=(X_test, y_test) ) # No need to encode categoricals - CatBoost handles them! print(f"Test AUC: {model.score(X_test, y_test):.4f}")
Ordered Boosting
Traditional gradient boosting has a subtle target leakage problem: the same data used to compute residuals is also used to fit trees. CatBoost solves this with ordered boosting:
- Data is randomly permuted at the start of training.
- For each example, residuals are computed using only the examples that appear before it in the permutation.
- This prevents the model from "seeing" the target value of the current example during residual computation.
Less Overfitting: Ordered boosting makes CatBoost more robust to overfitting out of the box, especially on small datasets. It often performs well with default parameters.
Categorical Feature Handling
Python
# CatBoost uses "ordered target statistics" for categoricals # For each category value, it computes: # encoding = (count_in_class + prior) / (total_count + 1) # But uses only preceding examples (ordered) to prevent leakage # Supports combinations of categorical features automatically model = CatBoostClassifier( cat_features=[0, 3, 7], # Can use column indices max_ctr_complexity=2, # Max categorical feature combinations ) # Works with string values directly - no preprocessing needed! # X can contain: ["New York", "Premium", "iOS"]
Framework Comparison
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Speed (large data) | Good | Fastest | Good |
| Categorical handling | Manual encoding | Good (native) | Best (ordered) |
| Default performance | Needs tuning | Needs tuning | Good out of box |
| Overfitting resistance | Regularization | Regularization | Ordered boosting |
| GPU support | Yes | Yes | Excellent |
| Ranking tasks | Good | Good | Excellent (YetiRank) |
Next: Tuning
Learn advanced hyperparameter tuning techniques including Bayesian optimization with Optuna.
Next: Tuning →
Lilly Tech Systems