Implementation Tips
The difference between a passing and failing ML coding interview often comes down to numerical stability, testing, and common pitfalls. This lesson covers the most frequent implementation mistakes, debugging strategies, and a comprehensive FAQ.
Top 10 Common Mistakes
These are the mistakes that cause candidates to fail ML coding interviews most frequently:
import numpy as np
# ---- MISTAKE 1: Forgetting to clip before log ----
# BAD:
loss = -np.mean(y * np.log(probs)) # log(0) = -inf!
# GOOD:
loss = -np.mean(y * np.log(np.clip(probs, 1e-15, 1.0)))
# ---- MISTAKE 2: Unstable softmax ----
# BAD:
softmax = np.exp(z) / np.sum(np.exp(z)) # overflow for large z!
# GOOD:
z_shifted = z - np.max(z, axis=1, keepdims=True)
softmax = np.exp(z_shifted) / np.sum(np.exp(z_shifted), axis=1, keepdims=True)
# ---- MISTAKE 3: Not scaling features ----
# BAD:
model.fit(X_raw, y) # Features with different scales -> slow convergence
# GOOD:
X_scaled = (X_raw - X_raw.mean(axis=0)) / (X_raw.std(axis=0) + 1e-8)
model.fit(X_scaled, y)
# ---- MISTAKE 4: Wrong axis in aggregation ----
# For X of shape (n_samples, n_features):
mean_per_feature = X.mean(axis=0) # Shape: (n_features,)
mean_per_sample = X.mean(axis=1) # Shape: (n_samples,)
# Confusing these is a very common bug
# ---- MISTAKE 5: Forgetting to center data in PCA ----
# BAD:
cov = X.T @ X / n # Not centered!
# GOOD:
X_centered = X - X.mean(axis=0)
cov = X_centered.T @ X_centered / (n - 1)
# ---- MISTAKE 6: Division by n vs n-1 in variance ----
# Population variance (biased): divide by n
# Sample variance (unbiased): divide by n-1
# NumPy default: np.var uses n (ddof=0), np.cov uses n-1
# For ML: usually doesn't matter, but know the difference
# ---- MISTAKE 7: Regularizing the bias term ----
# BAD:
loss = mse + alpha * np.sum(np.concatenate([weights, [bias]]) ** 2)
# GOOD:
loss = mse + alpha * np.sum(weights ** 2) # Don't regularize bias!
# ---- MISTAKE 8: Wrong gradient sign ----
# We MINIMIZE loss, so we SUBTRACT the gradient:
weights -= learning_rate * gradient # Correct
weights += learning_rate * gradient # WRONG - gradient ascent!
# ---- MISTAKE 9: Not handling empty clusters in K-means ----
# BAD:
centroids[k] = X[labels == k].mean(axis=0) # Error if cluster is empty!
# GOOD:
cluster_points = X[labels == k]
if len(cluster_points) > 0:
centroids[k] = cluster_points.mean(axis=0)
else:
centroids[k] = X[np.random.randint(len(X))]
# ---- MISTAKE 10: Mutating input arrays ----
# BAD:
def fit(self, X, y):
X -= X.mean(axis=0) # Modifies caller's data!
# GOOD:
def fit(self, X, y):
X_centered = X - X.mean(axis=0) # Create new array
Numerical Stability Cheat Sheet
| Problem | Cause | Solution |
|---|---|---|
| log(0) = -inf | Probability exactly 0 | np.log(np.clip(p, 1e-15, 1.0)) |
| exp overflow | Large input to softmax/sigmoid | Subtract max: exp(x - max(x)) |
| Division by zero | Zero variance, empty cluster | Add epsilon: x / (norm + 1e-8) |
| Vanishing gradients | Deep networks, sigmoid activation | Use ReLU, He initialization, batch norm |
| Exploding gradients | Large learning rate, bad init | Gradient clipping: np.clip(grad, -5, 5) |
| Matrix inversion | Singular or ill-conditioned matrix | Use np.linalg.solve or add regularization |
| Negative in sqrt | Float errors in distance computation | np.sqrt(np.maximum(x, 0)) |
Gradient Checking
Gradient checking is the most important debugging technique for ML implementations. It verifies your analytical gradient against a numerical approximation.
def gradient_check(model, X, y, epsilon=1e-7):
"""Verify analytical gradients using finite differences.
For each parameter theta_i:
numerical_grad = (L(theta_i + eps) - L(theta_i - eps)) / (2 * eps)
Compare with analytical gradient. Relative error should be < 1e-5.
"""
# Get analytical gradients
y_pred = model._forward(X)
loss, y_onehot = model._compute_loss(y_pred, y)
model._backward(y_onehot)
for layer_idx in range(len(model.weights)):
W = model.weights[layer_idx]
dW_analytical = model.dW[layer_idx]
# Check a random subset of parameters
n_checks = min(5, W.size)
indices = np.random.choice(W.size, n_checks, replace=False)
for idx in indices:
# Unravel index
pos = np.unravel_index(idx, W.shape)
# Compute numerical gradient
original = W[pos]
W[pos] = original + epsilon
y_pred_plus = model._forward(X)
loss_plus, _ = model._compute_loss(y_pred_plus, y)
W[pos] = original - epsilon
y_pred_minus = model._forward(X)
loss_minus, _ = model._compute_loss(y_pred_minus, y)
W[pos] = original # Restore
numerical = (loss_plus - loss_minus) / (2 * epsilon)
analytical = dW_analytical[pos]
# Relative error
denom = max(abs(numerical), abs(analytical), 1e-8)
rel_error = abs(numerical - analytical) / denom
status = "PASS" if rel_error < 1e-5 else "FAIL"
print(f"Layer {layer_idx} {pos}: "
f"analytical={analytical:.8f}, "
f"numerical={numerical:.8f}, "
f"rel_error={rel_error:.2e} [{status}]")
# Usage:
# gradient_check(nn, X_train[:10], y_train[:10])
Testing Your Implementations
# ---- Testing Strategy for ML Implementations ----
# 1. SANITY CHECK: Can it overfit a tiny dataset?
X_tiny = np.array([[1, 0], [0, 1], [-1, 0], [0, -1]])
y_tiny = np.array([0, 1, 0, 1])
model.fit(X_tiny, y_tiny)
assert model.accuracy(X_tiny, y_tiny) == 1.0, "Should perfectly fit 4 points"
# 2. KNOWN SOLUTION: Test against analytically known answer
# Linear regression: fit y = 3x + 2
X_known = np.array([[1], [2], [3], [4], [5]], dtype=float)
y_known = np.array([5, 8, 11, 14, 17], dtype=float) # y = 3x + 2
model.fit(X_known, y_known)
assert abs(model.weights[0] - 3.0) < 0.01, f"Weight should be ~3, got {model.weights[0]}"
assert abs(model.bias - 2.0) < 0.01, f"Bias should be ~2, got {model.bias}"
# 3. LOSS DECREASING: Training loss should monotonically decrease (for full batch)
model.fit(X_large, y_large)
for i in range(1, len(model.loss_history)):
assert model.loss_history[i] <= model.loss_history[i-1] + 1e-10, \
f"Loss increased at iteration {i}"
# 4. COMPARE WITH SKLEARN: Your implementation should match
from sklearn.linear_model import LinearRegression as SklearnLR
sklearn_model = SklearnLR().fit(X, y)
my_model = LinearRegressionNormal().fit(X, y)
assert np.allclose(sklearn_model.coef_, my_model.weights, atol=1e-6)
# 5. EDGE CASES: Test boundary conditions
# - Single sample
# - Single feature
# - All same labels (for classifier)
# - Perfectly correlated features
# - Very large or very small feature values
Performance Quick Reference
# ---- Vectorization Patterns for ML ----
# Distance matrix (n x m pairwise distances)
# SLOW: double loop
# FAST:
dists = np.sqrt(np.sum((X[:, None] - Y[None]) ** 2, axis=2))
# FASTEST (for Euclidean):
dists = np.sqrt(np.sum(X**2, axis=1, keepdims=True) +
np.sum(Y**2, axis=1) - 2 * X @ Y.T)
# One-hot encoding
# SLOW: loop
# FAST:
one_hot = np.zeros((n, n_classes))
one_hot[np.arange(n), labels] = 1
# Batch matrix multiply
# SLOW: loop over batch
# FAST:
result = np.einsum('bij,bjk->bik', A_batch, B_batch)
# Weighted sum
# SLOW: loop
# FAST:
weighted = np.sum(weights[:, None] * values, axis=0)
# Top-k selection
# SLOW: full sort
# FAST:
top_k_idx = np.argpartition(scores, -k)[-k:] # O(n) vs O(n log n)
Frequently Asked Questions
In order of frequency: (1) Logistic regression — the most commonly asked, tests sigmoid, cross-entropy, and gradient descent. (2) Linear regression — simpler variant, often asked as a warm-up. (3) K-means clustering — tests iterative optimization. (4) Decision trees — tests recursion and information theory. (5) KNN — tests distance computation and vectorization. (6) Neural network — hardest, reserved for ML-heavy roles. (7) PCA — tests linear algebra knowledge. Focus your preparation on logistic regression and K-means first.
Typically 30-45 minutes for the coding portion. For simpler algorithms (linear/logistic regression, KNN), the interviewer expects a complete, working implementation. For complex algorithms (neural networks, decision trees), they may ask for key components only (e.g., "implement the backpropagation step" rather than the full network). Budget your time: 5 minutes for clarifying questions and design, 20-25 minutes for implementation, 5-10 minutes for testing and edge cases.
Use classes with fit() and predict() methods, following the scikit-learn API pattern. This shows the interviewer you write production-quality code. However, if the interviewer just asks for a specific function (e.g., "implement gradient descent"), a function is fine. When in doubt, ask: "Would you like a full class implementation or just the core function?" The class approach is almost always preferred because it encapsulates state (weights, biases, loss history) cleanly.
Derive it on the spot. This actually impresses interviewers more than memorizing. For MSE loss with linear regression: (1) write the loss L = (1/2n) sum(Xw - y)^2, (2) expand and take derivative with respect to w, (3) you get dL/dw = (1/n) X.T(Xw - y). For cross-entropy with logistic regression: the derivation is harder, but the result is the same form: dL/dw = (1/n) X.T(sigmoid(Xw) - y). If you cannot derive it, state the formula and say you would verify with gradient checking.
Break it down into components you know. Every ML algorithm has: (1) a loss function, (2) a way to compute predictions, and (3) a way to update parameters. Ask the interviewer to clarify the loss function and prediction formula. From there, the gradient is mechanical. For tree-based algorithms, the core idea is always: find the best split by trying all features and thresholds, measure split quality with entropy or Gini, recurse. For clustering: assign points, update centers, repeat. Most algorithms follow these patterns.
Use the closed-form (normal equation) when: (1) the number of features d is small (< 10,000), (2) you want exact solution, (3) the matrix X.T @ X is invertible. Use gradient descent when: (1) many features (d > 10,000), (2) the problem does not have a closed-form solution (logistic regression, neural networks), (3) the data is too large to fit in memory (use mini-batch SGD). In interviews, always implement both if asked about linear regression, and explain the trade-offs: normal equation is O(d^3) for inversion but gives exact solution; gradient descent is iterative but scales to large d.
Numerical stability is the number one differentiator. Any candidate can write the basic algorithm, but handling edge cases correctly separates strong from weak. Specifically: (1) clipping probabilities before log, (2) subtracting max before exp in softmax, (3) adding epsilon to denominators, (4) using np.linalg.solve instead of np.linalg.inv, (5) handling empty clusters in K-means, (6) proper weight initialization in neural networks. After stability, interviewers look for vectorized code (no loops over samples) and clear code organization (fit/predict pattern).
Three quick verification strategies: (1) Overfit a tiny dataset: if your model cannot perfectly fit 4-5 training points, something is wrong. (2) Check loss is decreasing: print the loss every 100 iterations and verify it goes down. If it increases, the learning rate is too high or there is a gradient bug. (3) Known solution: create data with a known relationship (y = 3x + 2) and verify your model recovers the coefficients. Mention gradient checking as a technique you would use for debugging, even if you do not have time to implement it during the interview.
Algorithm Quick Reference
# ---- Complete Algorithm Cheat Sheet ----
# LINEAR REGRESSION
# Loss: L = (1/2n) * ||Xw + b - y||^2
# Gradient: dw = (1/n) * X.T @ (Xw + b - y), db = mean(Xw + b - y)
# Normal eq: w = (X.T X)^-1 X.T y
# LOGISTIC REGRESSION (Binary)
# Sigmoid: s(z) = 1 / (1 + exp(-z))
# Loss: L = -(1/n) * [y*log(s) + (1-y)*log(1-s)]
# Gradient: dw = (1/n) * X.T @ (s(Xw+b) - y)
# SOFTMAX (Multi-class)
# Softmax: p_k = exp(z_k - max(z)) / sum(exp(z_j - max(z)))
# Loss: L = -(1/n) * sum(y_onehot * log(p))
# Gradient: dW = (1/n) * X.T @ (p - y_onehot)
# DECISION TREE
# Entropy: H = -sum(p_k * log2(p_k))
# Gini: G = 1 - sum(p_k^2)
# Information Gain: IG = H(parent) - weighted_avg(H(children))
# K-MEANS
# Assign: labels[i] = argmin_k ||x_i - centroid_k||^2
# Update: centroid_k = mean(x_i where labels[i] == k)
# K-means++: next centroid ~ P(x) proportional to D(x)^2
# KNN
# Distance: d(a,b) = sqrt(sum((a-b)^2)) or other metric
# Predict: majority vote of K nearest training points
# NEURAL NETWORK
# Forward: z = a_prev @ W + b, a = activation(z)
# Backward: delta = (delta_next @ W_next.T) * act'(z)
# Weight grad: dW = a_prev.T @ delta / n
# PCA
# Center: X_c = X - mean(X)
# Covariance: C = X_c.T @ X_c / (n-1)
# Eigendecompose: eigenvalues, eigenvectors = eigh(C)
# Project: X_reduced = X_c @ top_k_eigenvectors
# ADAM OPTIMIZER
# m = beta1*m + (1-beta1)*g
# v = beta2*v + (1-beta2)*g^2
# m_hat = m / (1 - beta1^t)
# v_hat = v / (1 - beta2^t)
# w -= lr * m_hat / (sqrt(v_hat) + eps)
Key Takeaways
- Numerical stability is the #1 differentiator: clip, shift, add epsilon
- Always verify with gradient checking (finite differences vs analytical)
- Test on tiny datasets first: overfit 4-5 points, check loss decreases
- Vectorize everything: no loops over samples, use matrix operations
- Follow the fit/predict pattern for clean, interview-ready code
- Know when to use closed-form vs gradient descent
- Never regularize the bias term
- Never mutate input arrays
Lilly Tech Systems