ML Theory Deep Dive
These are the 30 most frequently asked ML theory questions in interviews, organized by topic. Each includes a model answer that demonstrates the depth interviewers expect — not textbook definitions, but intuitive explanations that show you truly understand the concepts.
Bias-Variance & Overfitting (Questions 1–6)
Q1: Explain the bias-variance tradeoff.
Model answer: "Bias is the error from wrong assumptions in the model — a linear model trying to fit a quadratic relationship has high bias. Variance is the error from sensitivity to small fluctuations in the training data — a 100-degree polynomial that fits every noise point has high variance. The tradeoff means that reducing one typically increases the other. A simple model (high bias, low variance) consistently makes the same mistakes. A complex model (low bias, high variance) gets the training data right but falls apart on new data. The goal is to find the sweet spot where total error (bias² + variance) is minimized."
Q2: How do you detect overfitting?
Model answer: "The primary signal is a large gap between training and validation performance. If training accuracy is 99% but validation accuracy is 75%, the model is memorizing rather than learning. I also look at learning curves: if training loss keeps decreasing but validation loss starts increasing, that is the overfitting point. In practice, I use k-fold cross-validation to get a more stable estimate of generalization performance."
Q3: What techniques reduce overfitting?
Model answer:
- More data — The most effective solution. Overfitting happens when the model has more capacity than the data can constrain.
- Regularization (L1/L2) — Penalizes large weights, forcing the model to use simpler patterns.
- Dropout — Randomly deactivates neurons during training, preventing co-adaptation.
- Early stopping — Stop training when validation loss stops improving.
- Data augmentation — Artificially expand training data through transformations.
- Reduce model complexity — Fewer layers, smaller trees, lower polynomial degree.
- Ensemble methods — Combine multiple models to average out individual overfitting.
Q4: What is the difference between L1 and L2 regularization?
Model answer: "L1 (Lasso) adds the sum of absolute weights to the loss. L2 (Ridge) adds the sum of squared weights. The critical difference is that L1 drives weights to exactly zero, effectively performing feature selection — it produces sparse models. L2 shrinks weights toward zero but never makes them exactly zero. Use L1 when you suspect many features are irrelevant and want automatic feature selection. Use L2 when all features might contribute and you want to prevent any single feature from dominating. In practice, Elastic Net combines both."
Q5: What is underfitting and how do you fix it?
Model answer: "Underfitting occurs when the model is too simple to capture the patterns in the data. Both training and validation performance are poor. Fixes include: increasing model complexity (more layers, higher polynomial degree), adding more features, reducing regularization strength, and training for more epochs. The learning curve for an underfitting model shows both training and validation errors converging at a high value."
Q6: Explain the bias-variance decomposition of MSE.
Model answer: "For any point, the expected MSE can be decomposed as: MSE = Bias² + Variance + Irreducible Error. Bias² is the squared difference between the expected prediction and the true value. Variance is the expected squared deviation of predictions across different training sets. Irreducible error is the noise inherent in the data that no model can eliminate. This decomposition is fundamental because it tells us exactly why our model is failing and which direction to move."
Model Selection & Evaluation (Questions 7–14)
Q7: When would you use precision vs. recall?
Model answer: "Use precision when false positives are expensive — spam detection (you do not want legitimate emails in spam). Use recall when false negatives are expensive — cancer screening (you do not want to miss a positive case). In fraud detection, I typically optimize for recall first (catch all fraud) then tune the threshold to bring precision to an acceptable level. F1 score is the harmonic mean when you need to balance both."
Q8: Explain AUC-ROC and when to use it.
Model answer: "AUC-ROC plots the true positive rate against the false positive rate at every classification threshold. The AUC (area under the curve) measures the probability that the model ranks a random positive example higher than a random negative example. A perfect model has AUC = 1.0, random guessing gives 0.5. Use AUC-ROC when you want to evaluate the model across all thresholds, not just at a specific operating point. However, AUC-ROC can be misleading with heavily imbalanced datasets — in those cases, precision-recall AUC (PR-AUC) is more informative."
Q9: What is cross-validation and why use k-fold?
Model answer: "K-fold cross-validation splits the data into k parts, trains on k-1 folds, and validates on the remaining fold, rotating through all k combinations. The average performance across folds gives a robust estimate of generalization. Use it when data is limited and a single train/test split would be unreliable. Standard k=5 or k=10. Stratified k-fold preserves class distribution in each fold — essential for imbalanced datasets. Time series data requires time-based splits (never train on future data to predict the past)."
Q10: How do you handle imbalanced datasets?
Model answer:
- Data level: Oversampling (SMOTE), undersampling, or combination (SMOTE + Tomek links)
- Algorithm level: Class weights (
class_weight='balanced'), focal loss, cost-sensitive learning - Evaluation level: Use PR-AUC, F1, or Matthews correlation coefficient instead of accuracy
- Ensemble level: BalancedRandomForest, EasyEnsemble
"The best approach depends on the degree of imbalance. For 90/10 splits, class weights usually suffice. For 99/1 splits, I combine SMOTE with an appropriate loss function. Always evaluate on the original (not resampled) distribution."
Q11: When would you choose a random forest over gradient boosting?
Model answer: "Random forests when I need: fast training with minimal hyperparameter tuning, robustness to outliers, or parallel training. Gradient boosting (XGBoost, LightGBM) when I need: maximum predictive accuracy and I am willing to spend time tuning hyperparameters. In practice, gradient boosting wins most tabular data competitions, but random forests are more forgiving in production where data distribution shifts over time — they degrade more gracefully."
Q12: Explain the difference between generative and discriminative models.
Model answer: "Discriminative models learn the decision boundary directly — P(y|x). Examples: logistic regression, SVM, neural networks. They answer 'what class is this?' Generative models learn the joint distribution P(x,y) or P(x|y) — they model how the data was generated. Examples: Naive Bayes, GANs, VAEs. They can generate new data points and handle missing features naturally. Discriminative models typically achieve higher classification accuracy because they focus all capacity on the decision boundary."
Q13: What is the curse of dimensionality?
Model answer: "As dimensions increase, the volume of the space grows exponentially, making data increasingly sparse. This has practical consequences: distance metrics become meaningless (all points are roughly equidistant), the amount of data needed to cover the space grows exponentially, and models overfit because there are too many features relative to samples. Solutions: dimensionality reduction (PCA, t-SNE), feature selection, regularization, and domain knowledge to select relevant features."
Q14: How do you select features for a model?
Model answer: "Three approaches: (1) Filter methods: statistical tests (correlation, mutual information, chi-squared) ranked by importance — fast but ignores feature interactions. (2) Wrapper methods: forward/backward selection, recursive feature elimination — considers interactions but expensive. (3) Embedded methods: L1 regularization, tree-based feature importance — best balance of speed and quality. In practice, I start with tree-based importance to identify candidates, then validate with permutation importance on a held-out set."
Deep Learning (Questions 15–22)
Q15: How does backpropagation work?
Model answer: "Backpropagation computes the gradient of the loss with respect to each weight by applying the chain rule of calculus backwards through the network. In the forward pass, we compute the output. In the backward pass, we start from the loss and propagate gradients layer by layer, multiplying by the local gradient at each step. These gradients tell us how to adjust each weight to reduce the loss. The key insight is that intermediate gradients are reused — we compute them once and share them, making the computation efficient."
Q16: Why do we need activation functions?
Model answer: "Without activation functions, a neural network is just a series of linear transformations, which collapses to a single linear transformation regardless of depth. Activation functions introduce nonlinearity, allowing the network to learn complex, nonlinear patterns. ReLU is the most common because it is computationally efficient and avoids the vanishing gradient problem that plagues sigmoid and tanh in deep networks."
Q17: Explain the vanishing gradient problem.
Model answer: "In deep networks with sigmoid or tanh activations, gradients are multiplied through many layers during backpropagation. Since the derivative of sigmoid is at most 0.25, gradients shrink exponentially as they flow backward — by the time they reach early layers, they are essentially zero, and those layers stop learning. Solutions: ReLU activation (gradient is either 0 or 1), residual connections (skip connections that provide a gradient highway), batch/layer normalization, and careful weight initialization (Xavier, He)."
Q18: What is batch normalization and why does it help?
Model answer: "Batch normalization normalizes the inputs to each layer to have zero mean and unit variance across the mini-batch, then applies learnable scale and shift parameters. It helps by: (1) reducing internal covariate shift (each layer sees a more stable input distribution), (2) enabling higher learning rates without divergence, (3) acting as a form of regularization (the noise from mini-batch statistics has a regularizing effect). The downside: it depends on batch statistics, which can cause issues with small batches or at inference time."
Q19: Compare CNNs and Transformers for vision tasks.
Model answer: "CNNs use local receptive fields and weight sharing, giving them a strong inductive bias for spatial data — they naturally detect edges, textures, and objects regardless of position. This makes them data-efficient and fast. Transformers use global self-attention, computing relationships between all patch pairs. They have fewer inductive biases, requiring more data to learn spatial invariances, but achieve higher accuracy with large datasets. In practice: CNNs for smaller datasets and edge deployment, Vision Transformers for large-scale applications with abundant data."
Q20: Explain the attention mechanism in Transformers.
Model answer: "Attention computes a weighted sum of values, where the weights are determined by the compatibility between a query and a set of keys. Specifically: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. The query asks 'what am I looking for?', the keys say 'what do I contain?', and the dot product between them determines how much each value contributes to the output. Multi-head attention runs this process multiple times in parallel with different learned projections, allowing the model to attend to information from different representation subspaces simultaneously."
Q21: What is transfer learning and when does it fail?
Model answer: "Transfer learning reuses a model pretrained on a large dataset for a different but related task. It works because early layers learn general features (edges, textures, language patterns) that transfer across tasks. It fails when: (1) the source and target domains are too different (ImageNet features for medical X-rays may not transfer well), (2) the target dataset is very large (training from scratch may be better), or (3) the label spaces are fundamentally different. For NLP, foundation models like BERT have largely solved this — fine-tuning pretrained language models is now the default approach."
Q22: How do you choose a learning rate?
Model answer: "Start with the learning rate finder: train for one epoch with the learning rate increasing exponentially from very small (1e-7) to very large (10). Plot loss vs. learning rate and pick the rate where loss is decreasing fastest (typically one order of magnitude before the minimum). In practice, I use learning rate schedulers: warmup for the first few epochs (linearly increase from 0 to target), then cosine annealing or reduce-on-plateau. AdamW with lr=3e-4 is a reliable default for Transformers. SGD with lr=0.1 and momentum is often better for CNNs."
Probabilistic & Statistical ML (Questions 23–30)
Q23: Explain Bayes' theorem and give a practical example.
Model answer: "Bayes' theorem computes the probability of a hypothesis given evidence: P(H|E) = P(E|H) * P(H) / P(E). Practical example: A medical test has 99% sensitivity (true positive rate) and 1% false positive rate. If 0.1% of the population has the disease, what is the probability that a positive test means you actually have the disease? P(disease|positive) = (0.99 * 0.001) / (0.99 * 0.001 + 0.01 * 0.999) = 9%. Despite a 99% accurate test, most positive results are false positives when the base rate is low. This is why threshold selection matters so much in ML."
Q24: What is the difference between parametric and non-parametric models?
Model answer: "Parametric models have a fixed number of parameters regardless of data size (linear regression, logistic regression, neural networks with fixed architecture). They make strong assumptions about data distribution and are faster to train. Non-parametric models grow in complexity with data size (KNN, decision trees, kernel SVMs). They make fewer assumptions and can model any function, but are slower at inference and require more data. In practice, most deep learning models are parametric but have so many parameters that they approximate non-parametric behavior."
Q25: Explain maximum likelihood estimation.
Model answer: "MLE finds the parameters that maximize the probability of observing the data we actually saw. For a set of observations, we compute the likelihood function P(data|parameters) and find the parameters that maximize it. In practice, we maximize the log-likelihood (equivalent, but numerically more stable). Example: for linear regression with Gaussian noise, MLE reduces to minimizing the mean squared error. For logistic regression, MLE leads to minimizing the binary cross-entropy loss. The loss functions we use in ML are usually derived from MLE."
Q26: What is the EM algorithm?
Model answer: "The Expectation-Maximization algorithm handles models with latent (hidden) variables. It alternates between two steps: E-step (estimate the probability of each latent variable given current parameters) and M-step (update parameters to maximize the expected log-likelihood). The classic example is fitting a Gaussian Mixture Model: E-step assigns each point a soft probability of belonging to each cluster, M-step updates the cluster means, variances, and mixing weights. EM is guaranteed to converge to a local optimum but not the global optimum."
Q27: How do you handle missing data?
Model answer: "First, understand the missing data mechanism: MCAR (missing completely at random), MAR (missing at random given observed data), or MNAR (missing not at random). For MCAR/MAR: imputation with mean/median (simple but loses variance), KNN imputation (considers similar samples), multiple imputation (generates several imputed datasets and combines results), or model-based imputation (MICE). For MNAR: the missingness itself contains information — add a binary 'is_missing' feature. Tree-based models (XGBoost) handle missing values natively through learned split directions."
Q28: What is A/B testing in the context of ML models?
Model answer: "A/B testing compares a new model (treatment) against the current model (control) on live traffic. Key steps: define a primary metric (e.g., click-through rate), calculate the required sample size for statistical significance (power analysis), randomly split users into control and treatment groups, run the experiment for a predetermined duration, and analyze results with a statistical test (typically a two-sample t-test or chi-squared test). Pitfalls: peeking at results too early (inflates false positive rate), network effects (users in different groups influencing each other), and novelty effects (initial lift that fades)."
Q29: Explain the difference between correlation and causation.
Model answer: "Correlation measures the statistical association between two variables. Causation means one variable directly influences the other. The distinction matters in ML because models learn correlations, not causation. A model might learn that ice cream sales predict drowning deaths (both are caused by hot weather, not by each other). In production, this means: (1) models can break when the underlying causal structure changes, (2) feature importance does not imply causal effect, and (3) making business decisions based on model features requires causal reasoning (A/B testing, instrumental variables, do-calculus)."
Q30: How do you detect and handle data drift?
Model answer: "Data drift is when the distribution of production data diverges from training data. I monitor three types: (1) Feature drift — track distribution statistics (mean, variance, percentiles) of input features and alert when they exceed thresholds (KS-test, PSI score). (2) Prediction drift — monitor the distribution of model predictions. (3) Concept drift — the relationship between features and target changes (hardest to detect without labels). When drift is detected: investigate the root cause, retrain with recent data, or trigger an automated retraining pipeline. In critical systems, I implement shadow models that are continuously retrained and compared against the production model."
Lilly Tech Systems