Intermediate
Implement Decision Trees & Random Forest
Decision trees are the second most commonly asked ML coding question after linear models. Build a decision tree classifier from scratch using information gain and Gini impurity, then extend it to a random forest ensemble.
Interview Question #1: “Implement a decision tree classifier from scratch. It should support both Gini impurity and information gain (entropy) as splitting criteria. Include fit and predict methods.”
Understanding the Splitting Criteria
Gini Impurity
Gini impurity measures how often a randomly chosen element would be misclassified. For a node with K classes:
Gini(node) = 1 - sum(p_k^2) for k in 1..K
# Example: node with 40 class-0 and 60 class-1 samples
# Gini = 1 - (0.4^2 + 0.6^2) = 1 - (0.16 + 0.36) = 0.48
Information Gain (Entropy)
Entropy measures the disorder in a node. Information gain is the reduction in entropy after a split:
Entropy(node) = -sum(p_k * log2(p_k)) for k in 1..K
Information_Gain = Entropy(parent) - weighted_avg(Entropy(children))
Decision Tree: Complete Solution
import numpy as np
from collections import Counter
class Node:
"""A node in the decision tree."""
def __init__(self, feature=None, threshold=None, left=None,
right=None, value=None):
self.feature = feature # Index of feature to split on
self.threshold = threshold # Threshold value for the split
self.left = left # Left subtree (feature <= threshold)
self.right = right # Right subtree (feature > threshold)
self.value = value # Class label (for leaf nodes)
def is_leaf(self):
return self.value is not None
class DecisionTreeClassifier:
"""
Decision tree classifier built from scratch.
Supports Gini impurity and entropy splitting criteria.
"""
def __init__(self, max_depth=10, min_samples_split=2,
criterion='gini'):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.criterion = criterion
self.root = None
def _gini(self, y):
"""Compute Gini impurity."""
counts = np.bincount(y)
probabilities = counts / len(y)
return 1.0 - np.sum(probabilities ** 2)
def _entropy(self, y):
"""Compute entropy."""
counts = np.bincount(y)
probabilities = counts / len(y)
# Filter out zero probabilities to avoid log(0)
probabilities = probabilities[probabilities > 0]
return -np.sum(probabilities * np.log2(probabilities))
def _impurity(self, y):
"""Compute impurity based on chosen criterion."""
if self.criterion == 'gini':
return self._gini(y)
return self._entropy(y)
def _information_gain(self, y, left_idx, right_idx):
"""Compute information gain from a split."""
parent_impurity = self._impurity(y)
n = len(y)
n_left, n_right = len(left_idx), len(right_idx)
if n_left == 0 or n_right == 0:
return 0
child_impurity = (
(n_left / n) * self._impurity(y[left_idx]) +
(n_right / n) * self._impurity(y[right_idx])
)
return parent_impurity - child_impurity
def _best_split(self, X, y):
"""Find the best feature and threshold to split on."""
best_gain = -1
best_feature = None
best_threshold = None
n_features = X.shape[1]
for feature_idx in range(n_features):
# Get unique thresholds (midpoints between sorted values)
values = np.sort(np.unique(X[:, feature_idx]))
thresholds = (values[:-1] + values[1:]) / 2
for threshold in thresholds:
left_idx = np.where(X[:, feature_idx] <= threshold)[0]
right_idx = np.where(X[:, feature_idx] > threshold)[0]
gain = self._information_gain(y, left_idx, right_idx)
if gain > best_gain:
best_gain = gain
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold, best_gain
def _build_tree(self, X, y, depth=0):
"""Recursively build the decision tree."""
n_samples = len(y)
n_classes = len(np.unique(y))
# Stopping conditions
if (depth >= self.max_depth or
n_classes == 1 or
n_samples < self.min_samples_split):
# Return leaf with majority class
most_common = Counter(y).most_common(1)[0][0]
return Node(value=most_common)
# Find best split
feature, threshold, gain = self._best_split(X, y)
if gain <= 0:
most_common = Counter(y).most_common(1)[0][0]
return Node(value=most_common)
# Split data
left_idx = np.where(X[:, feature] <= threshold)[0]
right_idx = np.where(X[:, feature] > threshold)[0]
# Recursively build subtrees
left = self._build_tree(X[left_idx], y[left_idx], depth + 1)
right = self._build_tree(X[right_idx], y[right_idx], depth + 1)
return Node(feature=feature, threshold=threshold,
left=left, right=right)
def fit(self, X, y):
"""Build the decision tree from training data."""
self.root = self._build_tree(X, y.astype(int))
return self
def _predict_single(self, x, node):
"""Traverse the tree for a single sample."""
if node.is_leaf():
return node.value
if x[node.feature] <= node.threshold:
return self._predict_single(x, node.left)
return self._predict_single(x, node.right)
def predict(self, X):
"""Predict class labels for all samples."""
return np.array([self._predict_single(x, self.root) for x in X])
# ---- Test ----
np.random.seed(42)
# Create a simple dataset: XOR-like pattern
X = np.array([
[0, 0], [0, 1], [1, 0], [1, 1],
[0.1, 0.1], [0.1, 0.9], [0.9, 0.1], [0.9, 0.9],
], dtype=float)
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
tree = DecisionTreeClassifier(max_depth=5, criterion='gini')
tree.fit(X, y)
predictions = tree.predict(X)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy:.2%}") # Should be 100%
print(f"Predictions: {predictions}")
print(f"Actual: {y}")
What interviewers look for:
- Clean separation of the Node class and the tree-building logic
- Correct implementation of both Gini and entropy
- Proper stopping conditions (max depth, min samples, pure node, no gain)
- Using midpoints between sorted values as candidate thresholds
- Handling edge cases (empty splits, single-class nodes)
Common mistake: Forgetting to handle the case where
information_gain <= 0. Without this check, the tree will keep splitting even when no improvement is possible, leading to infinite recursion or nonsensical splits.Interview Question #2: “Extend your decision tree to a random forest. Explain what bagging is and why random feature selection at each split reduces correlation between trees.”
Random Forest: Complete Solution
class RandomForestClassifier:
"""
Random forest classifier using bagging and random feature subsets.
"""
def __init__(self, n_trees=10, max_depth=10, min_samples_split=2,
max_features='sqrt', criterion='gini'):
self.n_trees = n_trees
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.max_features = max_features
self.criterion = criterion
self.trees = []
self.feature_indices = []
def _get_n_features(self, n_total):
"""Determine number of features to consider at each split."""
if self.max_features == 'sqrt':
return max(1, int(np.sqrt(n_total)))
elif self.max_features == 'log2':
return max(1, int(np.log2(n_total)))
elif isinstance(self.max_features, int):
return min(self.max_features, n_total)
return n_total # Use all features
def _bootstrap_sample(self, X, y):
"""Create a bootstrap sample (sampling with replacement)."""
n_samples = X.shape[0]
indices = np.random.choice(n_samples, size=n_samples, replace=True)
return X[indices], y[indices]
def fit(self, X, y):
"""Train the random forest."""
self.trees = []
self.feature_indices = []
n_total_features = X.shape[1]
n_features = self._get_n_features(n_total_features)
for _ in range(self.n_trees):
# Bootstrap sample
X_boot, y_boot = self._bootstrap_sample(X, y)
# Random feature subset
feat_idx = np.random.choice(n_total_features, size=n_features,
replace=False)
self.feature_indices.append(feat_idx)
# Train a decision tree on the bootstrap sample with subset features
tree = DecisionTreeClassifier(
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
criterion=self.criterion
)
tree.fit(X_boot[:, feat_idx], y_boot)
self.trees.append(tree)
return self
def predict(self, X):
"""Predict using majority vote across all trees."""
# Collect predictions from each tree
all_preds = np.array([
tree.predict(X[:, feat_idx])
for tree, feat_idx in zip(self.trees, self.feature_indices)
]) # Shape: (n_trees, n_samples)
# Majority vote for each sample
predictions = np.array([
Counter(all_preds[:, i]).most_common(1)[0][0]
for i in range(X.shape[0])
])
return predictions
# ---- Test ----
np.random.seed(42)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=200, n_features=10,
n_informative=5, random_state=42)
# Split manually (no sklearn train_test_split)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
rf = RandomForestClassifier(n_trees=20, max_depth=8, criterion='gini')
rf.fit(X_train, y_train)
train_acc = np.mean(rf.predict(X_train) == y_train)
test_acc = np.mean(rf.predict(X_test) == y_test)
print(f"Train accuracy: {train_acc:.2%}")
print(f"Test accuracy: {test_acc:.2%}")
What interviewers look for:
- Understanding of bagging: bootstrap sampling with replacement reduces variance
- Understanding of random feature subsets: using sqrt(n_features) reduces correlation between trees
- Majority voting: aggregating predictions from multiple trees
- Knowing that random forests reduce overfitting compared to individual decision trees
Interview Question #3: “How would you implement pruning for your decision tree? Explain the difference between pre-pruning and post-pruning.”
Pruning Strategies
Pre-Pruning (Already Implemented Above)
Pre-pruning stops tree growth early using constraints:
max_depth— limits the depth of the treemin_samples_split— minimum samples required to split a nodemin_information_gain— minimum gain required to make a split
Post-Pruning (Reduced Error Pruning)
def prune(self, node, X_val, y_val):
"""
Post-pruning using a validation set.
Replace subtrees that don't improve validation accuracy.
"""
if node.is_leaf():
return node
# Determine which validation samples go left and right
if node.feature is not None:
left_mask = X_val[:, node.feature] <= node.threshold
right_mask = ~left_mask
# Recursively prune children first
if np.sum(left_mask) > 0:
node.left = self.prune(node.left,
X_val[left_mask], y_val[left_mask])
if np.sum(right_mask) > 0:
node.right = self.prune(node.right,
X_val[right_mask], y_val[right_mask])
# Check if replacing this subtree with a leaf improves accuracy
if len(y_val) > 0:
# Current subtree accuracy
current_preds = np.array([self._predict_single(x, node)
for x in X_val])
current_acc = np.mean(current_preds == y_val)
# Leaf accuracy (majority class)
majority_class = Counter(y_val).most_common(1)[0][0]
leaf_acc = np.mean(majority_class == y_val)
# Replace with leaf if it is at least as accurate
if leaf_acc >= current_acc:
return Node(value=majority_class)
return node
Interview tip: When discussing pruning, always mention the bias-variance tradeoff. Pre-pruning is faster but may stop too early (underfitting). Post-pruning grows the full tree first and then removes branches, which is more thorough but computationally expensive. In practice, most libraries (scikit-learn) use cost-complexity pruning (ccp_alpha parameter).
Lilly Tech Systems