Intermediate
Common Distributions
10 interview questions on the probability distributions every data scientist and ML engineer must know. Each answer includes when to use the distribution and a real-world example.
Q1: What is the normal (Gaussian) distribution and why is it so important?
Model Answer: The normal distribution N(μ, σ²) is a continuous, symmetric, bell-shaped distribution defined by its mean μ and variance σ². Its PDF is: f(x) = (1/(σ√(2π))) · exp(-(x-μ)²/(2σ²)).
Why it is important: (1) The Central Limit Theorem guarantees that the mean of many independent random variables is approximately normal, regardless of the original distribution. This is why sample means, test statistics, and estimation errors tend to be normally distributed. (2) Many ML algorithms assume normality: linear regression assumes normally distributed errors, Gaussian Naive Bayes assumes Gaussian features, and Gaussian processes use normal distributions as priors. (3) The 68-95-99.7 rule: about 68% of data falls within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ. This makes it easy to reason about outliers and confidence intervals.
Why it is important: (1) The Central Limit Theorem guarantees that the mean of many independent random variables is approximately normal, regardless of the original distribution. This is why sample means, test statistics, and estimation errors tend to be normally distributed. (2) Many ML algorithms assume normality: linear regression assumes normally distributed errors, Gaussian Naive Bayes assumes Gaussian features, and Gaussian processes use normal distributions as priors. (3) The 68-95-99.7 rule: about 68% of data falls within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ. This makes it easy to reason about outliers and confidence intervals.
Q2: When would you use a binomial distribution? Give an example.
Model Answer: The binomial distribution B(n, p) models the number of successes in n independent Bernoulli trials, each with success probability p. PMF: P(X = k) = C(n,k) · pᵏ · (1-p)ⁿ⁻ᵏ. Mean = np, Variance = np(1-p).
Use when: (1) fixed number of trials, (2) each trial has exactly two outcomes (success/failure), (3) trials are independent, (4) probability of success is constant across trials.
Example: A Google ad has a 2% click-through rate. Out of 1,000 impressions, what is the probability of getting exactly 25 clicks? X ~ B(1000, 0.02). E[X] = 20, Var(X) = 19.6. Since n is large, we can approximate with N(20, 19.6): P(X = 25) ≈ normal PDF evaluated at z = (25 - 20)/√19.6 ≈ 1.13, giving roughly 0.026 or 2.6%.
In ML: Classification accuracy on a test set follows a binomial distribution, which is why confidence intervals for accuracy use the binomial proportion formula.
Use when: (1) fixed number of trials, (2) each trial has exactly two outcomes (success/failure), (3) trials are independent, (4) probability of success is constant across trials.
Example: A Google ad has a 2% click-through rate. Out of 1,000 impressions, what is the probability of getting exactly 25 clicks? X ~ B(1000, 0.02). E[X] = 20, Var(X) = 19.6. Since n is large, we can approximate with N(20, 19.6): P(X = 25) ≈ normal PDF evaluated at z = (25 - 20)/√19.6 ≈ 1.13, giving roughly 0.026 or 2.6%.
In ML: Classification accuracy on a test set follows a binomial distribution, which is why confidence intervals for accuracy use the binomial proportion formula.
Q3: When would you use a Poisson distribution instead of a binomial?
Model Answer: The Poisson distribution P(λ) models the number of events in a fixed interval of time or space when events occur independently at a constant average rate λ. PMF: P(X = k) = (λᵏ · e⁻λ) / k!. Mean = λ, Variance = λ.
Use Poisson instead of Binomial when: n is very large, p is very small, and the product λ = np is moderate. The Poisson is the limiting case of the binomial as n → ∞ and p → 0 with np constant.
Example: A website receives an average of 3 server errors per hour. What is the probability of seeing 5 or more errors in a given hour?
P(X ≥ 5) = 1 - P(X ≤ 4) = 1 - [P(0) + P(1) + P(2) + P(3) + P(4)]
= 1 - e⁻³[1 + 3 + 4.5 + 4.5 + 3.375] = 1 - e⁻³ · 16.375 = 1 - 0.8153 ≈ 0.185
In ML: Poisson regression for count data, modeling rare events like fraud or equipment failure, and natural language processing (word counts in documents often follow a Poisson-like distribution).
Use Poisson instead of Binomial when: n is very large, p is very small, and the product λ = np is moderate. The Poisson is the limiting case of the binomial as n → ∞ and p → 0 with np constant.
Example: A website receives an average of 3 server errors per hour. What is the probability of seeing 5 or more errors in a given hour?
P(X ≥ 5) = 1 - P(X ≤ 4) = 1 - [P(0) + P(1) + P(2) + P(3) + P(4)]
= 1 - e⁻³[1 + 3 + 4.5 + 4.5 + 3.375] = 1 - e⁻³ · 16.375 = 1 - 0.8153 ≈ 0.185
In ML: Poisson regression for count data, modeling rare events like fraud or equipment failure, and natural language processing (word counts in documents often follow a Poisson-like distribution).
Q4: Explain the exponential distribution and its relationship to the Poisson.
Model Answer: The exponential distribution Exp(λ) models the time between events in a Poisson process. If events occur at rate λ per unit time (Poisson), then the waiting time between consecutive events follows Exp(λ). PDF: f(x) = λ · e⁻λx for x ≥ 0. Mean = 1/λ, Variance = 1/λ².
Key property — Memorylessness: P(X > s + t | X > s) = P(X > t). The probability of waiting another t minutes does not depend on how long you have already waited. This is the only continuous distribution with this property.
Example: If a data center sees server failures at a rate of 2 per day, the time between failures is Exp(2). The probability of no failure in the next 12 hours (0.5 days): P(X > 0.5) = e⁻²×⁰·&sup5; = e⁻¹ ≈ 0.368.
In ML: Survival analysis, modeling user session durations, time-to-event prediction, and the exponential family of distributions (which underpins generalized linear models).
Key property — Memorylessness: P(X > s + t | X > s) = P(X > t). The probability of waiting another t minutes does not depend on how long you have already waited. This is the only continuous distribution with this property.
Example: If a data center sees server failures at a rate of 2 per day, the time between failures is Exp(2). The probability of no failure in the next 12 hours (0.5 days): P(X > 0.5) = e⁻²×⁰·&sup5; = e⁻¹ ≈ 0.368.
In ML: Survival analysis, modeling user session durations, time-to-event prediction, and the exponential family of distributions (which underpins generalized linear models).
Q5: What is the uniform distribution and when do you encounter it in ML?
Model Answer: The continuous uniform distribution U(a, b) assigns equal probability density to all values in [a, b]. PDF: f(x) = 1/(b-a) for a ≤ x ≤ b. Mean = (a+b)/2, Variance = (b-a)²/12.
When you encounter it in ML:
• Random initialization: Neural network weights are often initialized from U(-1/√n, 1/√n) (Xavier uniform initialization).
• Uninformative priors: In Bayesian statistics, a uniform prior represents "no prior knowledge" about a parameter.
• Random search: Hyperparameter optimization with random search samples from uniform distributions.
• Data augmentation: Random crop positions, rotation angles, and color jitter values are often uniformly distributed.
• Inverse transform sampling: To sample from any distribution, you start with U(0,1) and apply the inverse CDF.
When you encounter it in ML:
• Random initialization: Neural network weights are often initialized from U(-1/√n, 1/√n) (Xavier uniform initialization).
• Uninformative priors: In Bayesian statistics, a uniform prior represents "no prior knowledge" about a parameter.
• Random search: Hyperparameter optimization with random search samples from uniform distributions.
• Data augmentation: Random crop positions, rotation angles, and color jitter values are often uniformly distributed.
• Inverse transform sampling: To sample from any distribution, you start with U(0,1) and apply the inverse CDF.
Q6: Explain the Central Limit Theorem and why it matters for data science.
Model Answer: The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size n increases, regardless of the population distribution (as long as it has finite mean μ and variance σ²). Specifically: √n · (X̄ - μ) / σ → N(0, 1) as n → ∞.
Why it matters:
• Confidence intervals: We can construct confidence intervals for any population mean using normal approximation when n is large enough (typically n ≥ 30).
• Hypothesis testing: t-tests and z-tests rely on the CLT for their validity.
• A/B testing: When comparing average metrics (revenue, time on site), the CLT justifies using normal-based tests even when the underlying data is heavily skewed.
• Error analysis: Prediction errors across many examples tend to be approximately normal, which is why MSE (assuming Gaussian errors) works well in practice.
Caveat: The CLT requires finite variance. For heavy-tailed distributions (Cauchy, some financial returns), the CLT converges very slowly or not at all.
Why it matters:
• Confidence intervals: We can construct confidence intervals for any population mean using normal approximation when n is large enough (typically n ≥ 30).
• Hypothesis testing: t-tests and z-tests rely on the CLT for their validity.
• A/B testing: When comparing average metrics (revenue, time on site), the CLT justifies using normal-based tests even when the underlying data is heavily skewed.
• Error analysis: Prediction errors across many examples tend to be approximately normal, which is why MSE (assuming Gaussian errors) works well in practice.
Caveat: The CLT requires finite variance. For heavy-tailed distributions (Cauchy, some financial returns), the CLT converges very slowly or not at all.
Q7: What is the difference between PDF, PMF, and CDF?
Model Answer:
• PMF (Probability Mass Function): For discrete random variables. P(X = x) gives the probability of each specific value. Example: for a fair die, P(X = 3) = 1/6. PMF values sum to 1.
• PDF (Probability Density Function): For continuous random variables. f(x) is the density at point x. The probability of an exact value is 0; instead, P(a ≤ X ≤ b) = ∫f(x)dx from a to b. PDF values can exceed 1 (e.g., U(0, 0.5) has PDF = 2), but the total area must equal 1.
• CDF (Cumulative Distribution Function): Works for both. F(x) = P(X ≤ x). It is non-decreasing, right-continuous, ranges from 0 to 1. For continuous distributions, PDF = dCDF/dx.
Interview tip: If asked "what is the probability of X = exactly 3.0 for a normal distribution?" the answer is 0. This catches candidates who confuse PDF values with probabilities.
• PMF (Probability Mass Function): For discrete random variables. P(X = x) gives the probability of each specific value. Example: for a fair die, P(X = 3) = 1/6. PMF values sum to 1.
• PDF (Probability Density Function): For continuous random variables. f(x) is the density at point x. The probability of an exact value is 0; instead, P(a ≤ X ≤ b) = ∫f(x)dx from a to b. PDF values can exceed 1 (e.g., U(0, 0.5) has PDF = 2), but the total area must equal 1.
• CDF (Cumulative Distribution Function): Works for both. F(x) = P(X ≤ x). It is non-decreasing, right-continuous, ranges from 0 to 1. For continuous distributions, PDF = dCDF/dx.
Interview tip: If asked "what is the probability of X = exactly 3.0 for a normal distribution?" the answer is 0. This catches candidates who confuse PDF values with probabilities.
Q8: You observe that the number of customer support tickets follows a Poisson distribution with λ = 50 per day. What is the probability of receiving more than 60 tickets tomorrow?
Model Answer: For large λ, the Poisson distribution is well-approximated by a normal distribution: X ~ Poisson(50) ≈ N(50, 50) since for Poisson, mean = variance = λ.
Step 1: Standardize: z = (60 - 50) / √50 = 10/7.07 ≈ 1.414
Step 2: P(X > 60) ≈ P(Z > 1.414) ≈ 1 - Φ(1.414) ≈ 1 - 0.9213 ≈ 0.079
About 7.9% chance. In practice, you could use the exact Poisson CDF (computed numerically), but the normal approximation is what interviewers expect you to use for quick calculations. Apply a continuity correction for more accuracy: P(X > 60) = P(X ≥ 61) ≈ P(Z ≥ (60.5 - 50)/7.07) = P(Z ≥ 1.485) ≈ 0.069.
Step 1: Standardize: z = (60 - 50) / √50 = 10/7.07 ≈ 1.414
Step 2: P(X > 60) ≈ P(Z > 1.414) ≈ 1 - Φ(1.414) ≈ 1 - 0.9213 ≈ 0.079
About 7.9% chance. In practice, you could use the exact Poisson CDF (computed numerically), but the normal approximation is what interviewers expect you to use for quick calculations. Apply a continuity correction for more accuracy: P(X > 60) = P(X ≥ 61) ≈ P(Z ≥ (60.5 - 50)/7.07) = P(Z ≥ 1.485) ≈ 0.069.
Q9: What is the geometric distribution and when does it arise?
Model Answer: The geometric distribution models the number of Bernoulli trials needed to get the first success. PMF: P(X = k) = (1-p)ᵏ⁻¹ · p for k = 1, 2, 3, ... Mean = 1/p, Variance = (1-p)/p².
Key property — Memorylessness: It is the discrete analog of the exponential distribution. P(X > m + n | X > m) = P(X > n). If you have already failed m times, your expected remaining wait is the same as if you just started.
Example: A recruiter calls candidates with a 10% chance each agrees to an interview. How many calls are expected before the first yes? E[X] = 1/0.10 = 10 calls. P(first yes on call 5 or later) = (0.9)⁴ = 0.6561.
In ML: Modeling the number of iterations until convergence, the number of random restarts until finding a good solution, or the waiting time until a rare event in streaming data.
Key property — Memorylessness: It is the discrete analog of the exponential distribution. P(X > m + n | X > m) = P(X > n). If you have already failed m times, your expected remaining wait is the same as if you just started.
Example: A recruiter calls candidates with a 10% chance each agrees to an interview. How many calls are expected before the first yes? E[X] = 1/0.10 = 10 calls. P(first yes on call 5 or later) = (0.9)⁴ = 0.6561.
In ML: Modeling the number of iterations until convergence, the number of random restarts until finding a good solution, or the waiting time until a rare event in streaming data.
Q10: An interviewer asks: "Your model outputs probabilities. How do you verify that these probabilities are well-calibrated?" How does this relate to distributions?
Model Answer: A model is well-calibrated if among all predictions where it says "70% probability of class 1," approximately 70% of those cases are actually class 1.
How to check:
• Reliability diagram (calibration plot): Bin predictions by predicted probability, plot predicted vs actual frequency. A perfectly calibrated model lies on the diagonal y = x.
• Expected Calibration Error (ECE): Weighted average of |predicted - actual| across bins.
• Brier score: Mean squared error between predicted probabilities and actual binary outcomes. Decomposes into calibration + refinement.
Connection to distributions: If your model outputs P(Y=1|X) = 0.7, this implicitly says Y|X follows a Bernoulli(0.7) distribution. Calibration checks whether this distributional assumption matches reality. Poorly calibrated models often output overconfident probabilities (e.g., predicting 0.95 when the true rate is 0.7). Fixes: Platt scaling (fits a logistic regression on the model outputs), temperature scaling, or isotonic regression.
How to check:
• Reliability diagram (calibration plot): Bin predictions by predicted probability, plot predicted vs actual frequency. A perfectly calibrated model lies on the diagonal y = x.
• Expected Calibration Error (ECE): Weighted average of |predicted - actual| across bins.
• Brier score: Mean squared error between predicted probabilities and actual binary outcomes. Decomposes into calibration + refinement.
Connection to distributions: If your model outputs P(Y=1|X) = 0.7, this implicitly says Y|X follows a Bernoulli(0.7) distribution. Calibration checks whether this distributional assumption matches reality. Poorly calibrated models often output overconfident probabilities (e.g., predicting 0.95 when the true rate is 0.7). Fixes: Platt scaling (fits a logistic regression on the model outputs), temperature scaling, or isotonic regression.
Lilly Tech Systems