Intermediate
Random Forest Deep Dive
Harness the "wisdom of the crowd" — learn how combining hundreds of diverse decision trees creates one of the most robust and versatile ML algorithms.
Ensemble Learning and Bagging
Ensemble learning is the idea that combining multiple weak models creates a strong model. Random Forest uses a specific ensemble technique called bagging (Bootstrap AGGregatING):
Bagging Process:
1. Create N bootstrap samples (random samples WITH replacement)
- Each sample is ~63.2% of the original data (some rows repeated, some missing)
2. Train one decision tree on each bootstrap sample
3. Aggregate predictions:
- Classification: majority vote across all trees
- Regression: average prediction across all trees
Why it works:
- Each tree sees different data → different errors
- Averaging reduces variance (noise) dramatically
- Individual trees overfit, but the ensemble generalizes
How Random Forest Works
Random Forest adds an extra layer of randomness on top of bagging: random feature subsets. At each split, only a random subset of features is considered:
Random Forest Algorithm:
1. For each of N trees (n_estimators):
a. Draw a bootstrap sample from training data
b. Grow a decision tree, but at EACH SPLIT:
- Randomly select m features out of total M features
- Find the best split among only those m features
- Split the node
c. Grow tree to full depth (no pruning usually)
2. Make predictions:
- Classification: each tree votes, majority wins
- Regression: average all tree predictions
Default m values (max_features):
- Classification: m = sqrt(M)
- Regression: m = M/3
This decorrelates the trees → reduces variance further
Why random feature subsets? Without this, if one feature is very strong, every tree would split on it first. All trees would be similar (correlated), and averaging them wouldn't help much. By forcing each split to consider a random subset, trees become diverse, and the ensemble benefits from true "wisdom of the crowd."
Key Hyperparameters
| Parameter | Description | Default | Tuning Guide |
|---|---|---|---|
n_estimators | Number of trees in the forest | 100 | More is better (diminishing returns after 200-500). Increase until performance plateaus. |
max_depth | Maximum depth of each tree | None (unlimited) | Start with None. Reduce if overfitting (try 10-30). |
max_features | Features per split | 'sqrt' (clf) / 1.0 (reg) | 'sqrt', 'log2', or float (0.3-0.8). Lower = more diversity. |
min_samples_split | Min samples to split a node | 2 | Increase to reduce overfitting (try 5-20). |
min_samples_leaf | Min samples in a leaf | 1 | Increase for smoother predictions (try 2-10). |
bootstrap | Use bootstrap sampling | True | Keep True. False = each tree uses all data. |
n_jobs | Parallel CPU cores | None (1 core) | Set to -1 to use all cores. |
Feature Importance
Random Forest provides built-in feature importance based on how much each feature decreases impurity across all trees:
Mean Decrease in Impurity (MDI):
- For each feature, sum the Gini/entropy reduction across all splits
in all trees where that feature is used
- Normalize so importances sum to 1
- Available via: model.feature_importances_
Permutation Importance (more reliable):
- For each feature, shuffle its values and measure accuracy drop
- Features that cause large accuracy drops are important
- Less biased toward high-cardinality features
- Available via: sklearn.inspection.permutation_importance()
Out-of-Bag (OOB) Score
Each bootstrap sample leaves out ~36.8% of the data. These "out-of-bag" samples provide a free validation set:
OOB Score:
- For each sample, only trees that did NOT include it in
their bootstrap sample make predictions
- Average these predictions → OOB prediction
- Compare to actual values → OOB score
Benefits:
- No need for a separate validation set
- Nearly as accurate as cross-validation
- Saves computation time
Usage:
model = RandomForestClassifier(oob_score=True)
model.fit(X_train, y_train)
print(model.oob_score_) # e.g., 0.965
Advantages Over a Single Decision Tree
| Aspect | Single Decision Tree | Random Forest |
|---|---|---|
| Overfitting | Severe (memorizes noise) | Minimal (averaging cancels noise) |
| Stability | Unstable (small data change = different tree) | Stable (robust to perturbations) |
| Accuracy | Moderate | High (typically 5-15% better) |
| Variance | High | Low (averaging reduces variance) |
| Bias | Low (deep trees) | Low (inherits from deep trees) |
| Interpretability | High (can visualize) | Medium (feature importance only) |
| Training Speed | Fast | Slower (N trees, but parallelizable) |
Python Implementation with Feature Importance
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# --- Train Random Forest ---
rf = RandomForestClassifier(
n_estimators=200, # 200 trees
max_depth=None, # unlimited depth
max_features='sqrt', # sqrt(n_features) per split
min_samples_split=5,
min_samples_leaf=2,
oob_score=True, # out-of-bag score
n_jobs=-1, # use all CPU cores
random_state=42
)
rf.fit(X_train, y_train)
# --- Evaluate ---
y_pred = rf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=data.target_names)}")
# --- Feature Importance (MDI) ---
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:15] # top 15
plt.figure(figsize=(10, 6))
plt.bar(range(len(indices)), importances[indices])
plt.xticks(range(len(indices)),
[feature_names[i] for i in indices], rotation=45, ha='right')
plt.title("Random Forest - Top 15 Feature Importances (MDI)")
plt.ylabel("Mean Decrease in Impurity")
plt.tight_layout()
plt.show()
# --- Permutation Importance (more reliable) ---
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10,
random_state=42, n_jobs=-1)
perm_indices = np.argsort(perm_imp.importances_mean)[::-1][:15]
print("\nPermutation Importance (Top 15):")
for i, idx in enumerate(perm_indices):
print(f" {i+1}. {feature_names[idx]:<30} "
f"{perm_imp.importances_mean[idx]:.4f} "
f"+/- {perm_imp.importances_std[idx]:.4f}")
# --- Hyperparameter Tuning with GridSearchCV ---
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [None, 10, 20, 30],
'max_features': ['sqrt', 'log2', 0.5],
'min_samples_split': [2, 5, 10],
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42, n_jobs=-1),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
print(f"Test Accuracy: {grid_search.score(X_test, y_test):.4f}")
# --- Effect of n_estimators ---
n_trees = [1, 5, 10, 25, 50, 100, 200, 500]
train_scores, test_scores = [], []
for n in n_trees:
model = RandomForestClassifier(n_estimators=n, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
train_scores.append(accuracy_score(y_train, model.predict(X_train)))
test_scores.append(accuracy_score(y_test, model.predict(X_test)))
plt.figure(figsize=(8, 5))
plt.plot(n_trees, train_scores, 'b-o', label='Train Accuracy')
plt.plot(n_trees, test_scores, 'r-o', label='Test Accuracy')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Random Forest: Accuracy vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Tuning guide: Start with
n_estimators=200 and default hyperparameters. Random Forest works surprisingly well out of the box. Only tune if you need that last 1-2% accuracy. Focus on: (1) n_estimators (more is better, up to diminishing returns), (2) max_features (lower values = more diversity), (3) min_samples_leaf (prevents noise in leaves).
Lilly Tech Systems