Intermediate

Random Forest Deep Dive

Harness the "wisdom of the crowd" — learn how combining hundreds of diverse decision trees creates one of the most robust and versatile ML algorithms.

Ensemble Learning and Bagging

Ensemble learning is the idea that combining multiple weak models creates a strong model. Random Forest uses a specific ensemble technique called bagging (Bootstrap AGGregatING):

Bagging Process:
1. Create N bootstrap samples (random samples WITH replacement)
   - Each sample is ~63.2% of the original data (some rows repeated, some missing)
2. Train one decision tree on each bootstrap sample
3. Aggregate predictions:
   - Classification: majority vote across all trees
   - Regression: average prediction across all trees

Why it works:
  - Each tree sees different data → different errors
  - Averaging reduces variance (noise) dramatically
  - Individual trees overfit, but the ensemble generalizes

How Random Forest Works

Random Forest adds an extra layer of randomness on top of bagging: random feature subsets. At each split, only a random subset of features is considered:

Random Forest Algorithm:
1. For each of N trees (n_estimators):
   a. Draw a bootstrap sample from training data
   b. Grow a decision tree, but at EACH SPLIT:
      - Randomly select m features out of total M features
      - Find the best split among only those m features
      - Split the node
   c. Grow tree to full depth (no pruning usually)
2. Make predictions:
   - Classification: each tree votes, majority wins
   - Regression: average all tree predictions

Default m values (max_features):
  - Classification: m = sqrt(M)
  - Regression: m = M/3

This decorrelates the trees → reduces variance further

💡

Why random feature subsets? Without this, if one feature is very strong, every tree would split on it first. All trees would be similar (correlated), and averaging them wouldn't help much. By forcing each split to consider a random subset, trees become diverse, and the ensemble benefits from true "wisdom of the crowd."

Key Hyperparameters

Parameter	Description	Default	Tuning Guide
`n_estimators`	Number of trees in the forest	100	More is better (diminishing returns after 200-500). Increase until performance plateaus.
`max_depth`	Maximum depth of each tree	None (unlimited)	Start with None. Reduce if overfitting (try 10-30).
`max_features`	Features per split	'sqrt' (clf) / 1.0 (reg)	'sqrt', 'log2', or float (0.3-0.8). Lower = more diversity.
`min_samples_split`	Min samples to split a node	2	Increase to reduce overfitting (try 5-20).
`min_samples_leaf`	Min samples in a leaf	1	Increase for smoother predictions (try 2-10).
`bootstrap`	Use bootstrap sampling	True	Keep True. False = each tree uses all data.
`n_jobs`	Parallel CPU cores	None (1 core)	Set to -1 to use all cores.

Feature Importance

Random Forest provides built-in feature importance based on how much each feature decreases impurity across all trees:

Mean Decrease in Impurity (MDI):
  - For each feature, sum the Gini/entropy reduction across all splits
    in all trees where that feature is used
  - Normalize so importances sum to 1
  - Available via: model.feature_importances_

Permutation Importance (more reliable):
  - For each feature, shuffle its values and measure accuracy drop
  - Features that cause large accuracy drops are important
  - Less biased toward high-cardinality features
  - Available via: sklearn.inspection.permutation_importance()

Out-of-Bag (OOB) Score

Each bootstrap sample leaves out ~36.8% of the data. These "out-of-bag" samples provide a free validation set:

OOB Score:
  - For each sample, only trees that did NOT include it in
    their bootstrap sample make predictions
  - Average these predictions → OOB prediction
  - Compare to actual values → OOB score

Benefits:
  - No need for a separate validation set
  - Nearly as accurate as cross-validation
  - Saves computation time

Usage:
  model = RandomForestClassifier(oob_score=True)
  model.fit(X_train, y_train)
  print(model.oob_score_)  # e.g., 0.965

Advantages Over a Single Decision Tree

Aspect	Single Decision Tree	Random Forest
Overfitting	Severe (memorizes noise)	Minimal (averaging cancels noise)
Stability	Unstable (small data change = different tree)	Stable (robust to perturbations)
Accuracy	Moderate	High (typically 5-15% better)
Variance	High	Low (averaging reduces variance)
Bias	Low (deep trees)	Low (inherits from deep trees)
Interpretability	High (can visualize)	Medium (feature importance only)
Training Speed	Fast	Slower (N trees, but parallelizable)

Python Implementation with Feature Importance

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Train Random Forest ---
rf = RandomForestClassifier(
    n_estimators=200,       # 200 trees
    max_depth=None,         # unlimited depth
    max_features='sqrt',    # sqrt(n_features) per split
    min_samples_split=5,
    min_samples_leaf=2,
    oob_score=True,         # out-of-bag score
    n_jobs=-1,              # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

# --- Evaluate ---
y_pred = rf.predict(X_test)
print(f"Test Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score:      {rf.oob_score_:.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=data.target_names)}")

# --- Feature Importance (MDI) ---
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:15]  # top 15

plt.figure(figsize=(10, 6))
plt.bar(range(len(indices)), importances[indices])
plt.xticks(range(len(indices)),
           [feature_names[i] for i in indices], rotation=45, ha='right')
plt.title("Random Forest - Top 15 Feature Importances (MDI)")
plt.ylabel("Mean Decrease in Impurity")
plt.tight_layout()
plt.show()

# --- Permutation Importance (more reliable) ---
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10,
                                   random_state=42, n_jobs=-1)
perm_indices = np.argsort(perm_imp.importances_mean)[::-1][:15]

print("\nPermutation Importance (Top 15):")
for i, idx in enumerate(perm_indices):
    print(f"  {i+1}. {feature_names[idx]:<30} "
          f"{perm_imp.importances_mean[idx]:.4f} "
          f"+/- {perm_imp.importances_std[idx]:.4f}")

# --- Hyperparameter Tuning with GridSearchCV ---
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'max_features': ['sqrt', 'log2', 0.5],
    'min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
print(f"Test Accuracy:    {grid_search.score(X_test, y_test):.4f}")

# --- Effect of n_estimators ---
n_trees = [1, 5, 10, 25, 50, 100, 200, 500]
train_scores, test_scores = [], []

for n in n_trees:
    model = RandomForestClassifier(n_estimators=n, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))

plt.figure(figsize=(8, 5))
plt.plot(n_trees, train_scores, 'b-o', label='Train Accuracy')
plt.plot(n_trees, test_scores, 'r-o', label='Test Accuracy')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Random Forest: Accuracy vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

✅

Tuning guide: Start with n_estimators=200 and default hyperparameters. Random Forest works surprisingly well out of the box. Only tune if you need that last 1-2% accuracy. Focus on: (1) n_estimators (more is better, up to diminishing returns), (2) max_features (lower values = more diversity), (3) min_samples_leaf (prevents noise in leaves).

← Previous Decision Trees Next → Gradient Boosting