Intermediate

Random Forest Deep Dive

Harness the "wisdom of the crowd" — learn how combining hundreds of diverse decision trees creates one of the most robust and versatile ML algorithms.

Ensemble Learning and Bagging

Ensemble learning is the idea that combining multiple weak models creates a strong model. Random Forest uses a specific ensemble technique called bagging (Bootstrap AGGregatING):

Bagging Process:
1. Create N bootstrap samples (random samples WITH replacement)
   - Each sample is ~63.2% of the original data (some rows repeated, some missing)
2. Train one decision tree on each bootstrap sample
3. Aggregate predictions:
   - Classification: majority vote across all trees
   - Regression: average prediction across all trees

Why it works:
  - Each tree sees different data → different errors
  - Averaging reduces variance (noise) dramatically
  - Individual trees overfit, but the ensemble generalizes

How Random Forest Works

Random Forest adds an extra layer of randomness on top of bagging: random feature subsets. At each split, only a random subset of features is considered:

Random Forest Algorithm:
1. For each of N trees (n_estimators):
   a. Draw a bootstrap sample from training data
   b. Grow a decision tree, but at EACH SPLIT:
      - Randomly select m features out of total M features
      - Find the best split among only those m features
      - Split the node
   c. Grow tree to full depth (no pruning usually)
2. Make predictions:
   - Classification: each tree votes, majority wins
   - Regression: average all tree predictions

Default m values (max_features):
  - Classification: m = sqrt(M)
  - Regression: m = M/3

This decorrelates the trees → reduces variance further
💡
Why random feature subsets? Without this, if one feature is very strong, every tree would split on it first. All trees would be similar (correlated), and averaging them wouldn't help much. By forcing each split to consider a random subset, trees become diverse, and the ensemble benefits from true "wisdom of the crowd."

Key Hyperparameters

ParameterDescriptionDefaultTuning Guide
n_estimatorsNumber of trees in the forest100More is better (diminishing returns after 200-500). Increase until performance plateaus.
max_depthMaximum depth of each treeNone (unlimited)Start with None. Reduce if overfitting (try 10-30).
max_featuresFeatures per split'sqrt' (clf) / 1.0 (reg)'sqrt', 'log2', or float (0.3-0.8). Lower = more diversity.
min_samples_splitMin samples to split a node2Increase to reduce overfitting (try 5-20).
min_samples_leafMin samples in a leaf1Increase for smoother predictions (try 2-10).
bootstrapUse bootstrap samplingTrueKeep True. False = each tree uses all data.
n_jobsParallel CPU coresNone (1 core)Set to -1 to use all cores.

Feature Importance

Random Forest provides built-in feature importance based on how much each feature decreases impurity across all trees:

Mean Decrease in Impurity (MDI):
  - For each feature, sum the Gini/entropy reduction across all splits
    in all trees where that feature is used
  - Normalize so importances sum to 1
  - Available via: model.feature_importances_

Permutation Importance (more reliable):
  - For each feature, shuffle its values and measure accuracy drop
  - Features that cause large accuracy drops are important
  - Less biased toward high-cardinality features
  - Available via: sklearn.inspection.permutation_importance()

Out-of-Bag (OOB) Score

Each bootstrap sample leaves out ~36.8% of the data. These "out-of-bag" samples provide a free validation set:

OOB Score:
  - For each sample, only trees that did NOT include it in
    their bootstrap sample make predictions
  - Average these predictions → OOB prediction
  - Compare to actual values → OOB score

Benefits:
  - No need for a separate validation set
  - Nearly as accurate as cross-validation
  - Saves computation time

Usage:
  model = RandomForestClassifier(oob_score=True)
  model.fit(X_train, y_train)
  print(model.oob_score_)  # e.g., 0.965

Advantages Over a Single Decision Tree

AspectSingle Decision TreeRandom Forest
OverfittingSevere (memorizes noise)Minimal (averaging cancels noise)
StabilityUnstable (small data change = different tree)Stable (robust to perturbations)
AccuracyModerateHigh (typically 5-15% better)
VarianceHighLow (averaging reduces variance)
BiasLow (deep trees)Low (inherits from deep trees)
InterpretabilityHigh (can visualize)Medium (feature importance only)
Training SpeedFastSlower (N trees, but parallelizable)

Python Implementation with Feature Importance

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Train Random Forest ---
rf = RandomForestClassifier(
    n_estimators=200,       # 200 trees
    max_depth=None,         # unlimited depth
    max_features='sqrt',    # sqrt(n_features) per split
    min_samples_split=5,
    min_samples_leaf=2,
    oob_score=True,         # out-of-bag score
    n_jobs=-1,              # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

# --- Evaluate ---
y_pred = rf.predict(X_test)
print(f"Test Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score:      {rf.oob_score_:.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=data.target_names)}")

# --- Feature Importance (MDI) ---
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1][:15]  # top 15

plt.figure(figsize=(10, 6))
plt.bar(range(len(indices)), importances[indices])
plt.xticks(range(len(indices)),
           [feature_names[i] for i in indices], rotation=45, ha='right')
plt.title("Random Forest - Top 15 Feature Importances (MDI)")
plt.ylabel("Mean Decrease in Impurity")
plt.tight_layout()
plt.show()

# --- Permutation Importance (more reliable) ---
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10,
                                   random_state=42, n_jobs=-1)
perm_indices = np.argsort(perm_imp.importances_mean)[::-1][:15]

print("\nPermutation Importance (Top 15):")
for i, idx in enumerate(perm_indices):
    print(f"  {i+1}. {feature_names[idx]:<30} "
          f"{perm_imp.importances_mean[idx]:.4f} "
          f"+/- {perm_imp.importances_std[idx]:.4f}")

# --- Hyperparameter Tuning with GridSearchCV ---
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'max_features': ['sqrt', 'log2', 0.5],
    'min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
print(f"Test Accuracy:    {grid_search.score(X_test, y_test):.4f}")

# --- Effect of n_estimators ---
n_trees = [1, 5, 10, 25, 50, 100, 200, 500]
train_scores, test_scores = [], []

for n in n_trees:
    model = RandomForestClassifier(n_estimators=n, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, model.predict(X_train)))
    test_scores.append(accuracy_score(y_test, model.predict(X_test)))

plt.figure(figsize=(8, 5))
plt.plot(n_trees, train_scores, 'b-o', label='Train Accuracy')
plt.plot(n_trees, test_scores, 'r-o', label='Test Accuracy')
plt.xlabel('Number of Trees (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Random Forest: Accuracy vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Tuning guide: Start with n_estimators=200 and default hyperparameters. Random Forest works surprisingly well out of the box. Only tune if you need that last 1-2% accuracy. Focus on: (1) n_estimators (more is better, up to diminishing returns), (2) max_features (lower values = more diversity), (3) min_samples_leaf (prevents noise in leaves).