Intermediate
Core ML Fundamentals
15 essential interview questions and model answers on foundational machine learning concepts. These questions appear in nearly every ML interview.
Q1: What is the bias-variance tradeoff?
Model Answer: The bias-variance tradeoff describes the tension between two sources of prediction error. Bias is the error from overly simplistic assumptions in the model — a high-bias model consistently misses the true pattern (underfitting). Variance is the error from excessive sensitivity to small fluctuations in the training data — a high-variance model fits noise rather than signal (overfitting). The total expected error equals bias² + variance + irreducible noise. The goal is to find the sweet spot: a model complex enough to capture the true pattern but not so complex that it memorizes training noise. In practice, we manage this tradeoff through regularization, cross-validation, and choosing appropriate model complexity.
Q2: What is overfitting and how do you detect it?
Model Answer: Overfitting occurs when a model learns the training data too well, including its noise and random fluctuations, leading to poor generalization on unseen data. You detect it when training accuracy is significantly higher than validation accuracy — the gap between training and validation loss is the key signal. Common remedies include: (1) collecting more training data, (2) applying regularization (L1, L2, dropout), (3) reducing model complexity (fewer parameters, shallower trees), (4) early stopping during training, and (5) using data augmentation. Cross-validation provides a more robust estimate of generalization performance than a single train/validation split.
Q3: What is underfitting and when does it happen?
Model Answer: Underfitting occurs when a model is too simple to capture the underlying structure of the data. Both training and validation performance are poor. It happens when: (1) the model has insufficient capacity (e.g., using linear regression for a nonlinear problem), (2) important features are missing, (3) there is too much regularization, or (4) training is stopped too early. To fix underfitting, you can increase model complexity, add more relevant features, reduce regularization strength, train longer, or use feature engineering to create more expressive inputs.
Q4: Explain the curse of dimensionality.
Model Answer: The curse of dimensionality refers to the problems that arise when working with data in high-dimensional spaces. As the number of features grows, the volume of the feature space increases exponentially, making data increasingly sparse. This causes several issues: (1) distance metrics become less meaningful because all points become roughly equidistant, breaking algorithms like KNN, (2) the amount of data needed to maintain statistical significance grows exponentially, (3) models become more prone to overfitting because there are more parameters to fit with relatively fewer data points. Solutions include dimensionality reduction (PCA, feature selection), regularization, and domain knowledge to select only the most relevant features.
Q5: What is the difference between parametric and non-parametric models?
Model Answer: Parametric models assume a fixed functional form and have a fixed number of parameters regardless of training data size. Examples: linear regression, logistic regression, naive Bayes. They are fast to train, require less data, but may underfit if the assumption is wrong. Non-parametric models do not assume a fixed form and their complexity grows with data size. Examples: KNN, decision trees, kernel SVM. They are more flexible and can capture complex patterns, but require more data, are slower to predict, and are more prone to overfitting. The choice depends on data size, dimensionality, and whether the underlying relationship matches the parametric assumption.
Q6: What is feature selection and why does it matter?
Model Answer: Feature selection is the process of choosing a subset of relevant features for model training, discarding irrelevant or redundant ones. It matters because: (1) it reduces overfitting by removing noise, (2) it improves model interpretability, (3) it reduces training time and computational cost, and (4) it can improve accuracy. Three main approaches exist: Filter methods (correlation, mutual information, chi-squared test) rank features independently of the model. Wrapper methods (forward selection, backward elimination, recursive feature elimination) evaluate feature subsets using model performance. Embedded methods (L1 regularization, tree-based feature importance) perform selection during model training. In practice, I start with filter methods for initial screening, then use embedded methods like L1 or tree importance for final selection.
Q7: Explain the purpose of train, validation, and test splits.
Model Answer: The three-way split serves distinct purposes. The training set (typically 60-80%) is used to fit model parameters. The validation set (10-20%) is used during development to tune hyperparameters and make model selection decisions — it provides an unbiased estimate during the iterative development process. The test set (10-20%) is held out completely and used only once at the end to provide a final unbiased estimate of model performance on unseen data. The critical rule is: never use the test set to make any decision during model development, or it becomes another validation set and your performance estimate becomes optimistic. A common split is 70/15/15 or 80/10/10, but the exact ratio depends on dataset size.
Q8: What is cross-validation and when would you use it?
Model Answer: Cross-validation is a technique for estimating model generalization performance by repeatedly splitting data into training and validation folds. In k-fold CV, the data is split into k equal parts; the model is trained on k-1 folds and validated on the remaining fold, repeating k times. The results are averaged across all folds. This gives a more reliable estimate than a single train/validation split, especially with limited data. Use it when: (1) you have limited data and cannot afford a large validation set, (2) you need a robust estimate for hyperparameter tuning, (3) you want to detect if model performance is sensitive to data splitting. Common values are k=5 or k=10. Stratified k-fold preserves class proportions in each fold, which is important for imbalanced datasets. Leave-one-out CV (k=N) is computationally expensive but useful for very small datasets.
Q9: What is regularization and why is it used?
Model Answer: Regularization is a technique that adds a penalty term to the loss function to discourage overly complex models and prevent overfitting. It works by constraining or shrinking model parameters toward zero, effectively reducing model capacity. The two most common forms are: L1 (Lasso), which adds the sum of absolute parameter values and produces sparse solutions (some weights become exactly zero, performing feature selection). L2 (Ridge), which adds the sum of squared parameter values and shrinks all weights toward zero but rarely to exactly zero. ElasticNet combines both. The regularization strength (lambda) controls the penalty — too much causes underfitting, too little allows overfitting. You tune lambda using cross-validation.
Q10: What is the difference between generative and discriminative models?
Model Answer: Discriminative models learn the decision boundary directly by modeling P(y|x) — the probability of the label given the features. Examples: logistic regression, SVM, neural networks. They tend to perform better with sufficient data because they focus on what matters for prediction. Generative models learn the joint distribution P(x,y) or equivalently P(x|y) and P(y), then use Bayes' theorem to compute P(y|x). Examples: naive Bayes, Gaussian mixture models, hidden Markov models. They can generate new data samples, handle missing features more naturally, and work better with small datasets. Discriminative models are typically preferred for classification when you have enough labeled data; generative models are valuable when you need to model the data distribution itself or handle semi-supervised settings.
Q11: What is the No Free Lunch theorem?
Model Answer: The No Free Lunch (NFL) theorem states that no single machine learning algorithm is universally best for all problems. Averaged over all possible data distributions, every algorithm performs equally well (or poorly). The practical implication is that you must match your algorithm to the problem structure. A linear model outperforms a neural network on truly linear data; a random forest may beat both on tabular data with nonlinear interactions. The NFL theorem justifies why data scientists try multiple algorithms and why understanding the assumptions behind each method is crucial. It also means that domain knowledge — understanding your data's structure — is more valuable than blindly applying the most complex model available.
Q12: What is the difference between a model parameter and a hyperparameter?
Model Answer: Parameters are learned from the training data during the optimization process. Examples: weights in a neural network, coefficients in linear regression, split points in a decision tree. Hyperparameters are set before training and control the learning process itself. Examples: learning rate, number of hidden layers, regularization strength, number of trees in a random forest. Parameters are optimized by the training algorithm (e.g., gradient descent minimizing a loss function), while hyperparameters are tuned using validation performance through methods like grid search, random search, or Bayesian optimization. A key distinction: parameters define the model, hyperparameters define how the model is trained.
Q13: Explain the concept of inductive bias.
Model Answer: Inductive bias is the set of assumptions a learning algorithm makes to generalize beyond the training data to unseen inputs. Without inductive bias, a model could only memorize training examples and would have no basis for prediction. Every algorithm has inductive bias: linear regression assumes a linear relationship; decision trees assume axis-aligned splits are sufficient; KNN assumes nearby points have similar labels; neural networks with ReLU assume piecewise linear functions. The right inductive bias matches the true structure of the problem and enables better generalization. When the bias matches reality, the algorithm learns efficiently with less data. When it mismatches, the algorithm will underfit. Choosing an algorithm is essentially choosing an inductive bias.
Q14: What is data leakage and why is it dangerous?
Model Answer: Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates that do not hold in production. Common forms include: (1) Target leakage — using features that include information about the target variable that would not be available at prediction time (e.g., using "account_closed_date" to predict churn). (2) Train-test contamination — preprocessing (like normalization or feature selection) on the full dataset before splitting, allowing test statistics to influence training. (3) Temporal leakage — using future data to predict the past in time-series problems. It is dangerous because the model appears to work perfectly in development but fails catastrophically in production. Prevention: always split data first, then preprocess each split independently; use time-based splits for temporal data; carefully audit features for target leakage.
Q15: What is the difference between bagging and boosting?
Model Answer: Both are ensemble methods that combine multiple weak learners, but they work differently. Bagging (Bootstrap Aggregating) trains models independently on random bootstrap samples of the data, then averages predictions (regression) or takes a majority vote (classification). It reduces variance and is effective when the base model overfits. Random Forest is the best-known bagging method. Boosting trains models sequentially, where each new model focuses on the errors made by the previous ensemble. It reduces bias and can turn weak learners into a strong learner. Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM. Key tradeoffs: bagging is easier to parallelize and more resistant to overfitting; boosting generally achieves higher accuracy but is more sensitive to noisy data and hyperparameters, and is prone to overfitting if not properly regularized.
Interview Tip: For every concept you explain, try to mention (1) the intuition, (2) a practical example, and (3) when you would or would not use it. This three-part structure demonstrates both theoretical knowledge and practical experience.
Lilly Tech Systems