Intermediate
Statistics Questions
15 essential statistics interview questions and model answers. These questions appear in nearly every data science interview loop at top tech companies.
Q1: What is the Central Limit Theorem and why does it matter?
Model Answer: The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population's original distribution — provided the population has finite variance. Specifically, if you draw samples of size n from a population with mean μ and standard deviation σ, the distribution of sample means will be approximately Normal(μ, σ/√n) for sufficiently large n (typically n ≥ 30). This matters enormously in practice because it justifies using z-tests and t-tests for hypothesis testing, constructing confidence intervals, and making inferences even when the underlying data is not normally distributed. It is the foundation of most frequentist statistical inference.
Q2: Explain the difference between a population and a sample.
Model Answer: A population is the complete set of all elements you want to study (e.g., all users of an app). A sample is a subset drawn from that population to make inferences. We use samples because collecting data from an entire population is usually impossible or impractical. Key distinctions: population parameters (like μ and σ) are fixed but unknown; sample statistics (like x̄ and s) are calculated from data and used to estimate population parameters. The quality of our inferences depends on how representative the sample is — random sampling minimizes selection bias. Standard error quantifies how much a sample statistic is expected to vary from the true population parameter.
Q3: What is a p-value? What does it NOT mean?
Model Answer: A p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one calculated from your data, assuming the null hypothesis is true. If p < α (commonly 0.05), we reject the null hypothesis. Critically, a p-value is NOT: (1) the probability that the null hypothesis is true — that would require Bayesian reasoning with a prior, (2) the probability of making an error — that is the significance level α, (3) a measure of effect size — a tiny, practically meaningless effect can produce a very small p-value with a large enough sample. In practice, always report effect size and confidence intervals alongside p-values to give a complete picture.
Q4: Explain Type I and Type II errors with an example.
Model Answer: A Type I error (false positive) occurs when you reject a true null hypothesis. Example: concluding a new feature increases engagement when it actually has no effect. The probability of a Type I error is α (the significance level, typically 0.05). A Type II error (false negative) occurs when you fail to reject a false null hypothesis. Example: concluding a new feature has no effect when it actually does increase engagement. The probability of a Type II error is β, and power = 1 - β is the probability of correctly detecting a real effect. There is a fundamental tradeoff: lowering α reduces false positives but increases false negatives. In practice, we choose α and desired power based on the relative cost of each error type — for medical tests, a false negative (missing a disease) may be far worse than a false positive.
Q5: What is a confidence interval and how do you interpret it?
Model Answer: A 95% confidence interval is a range of values constructed from sample data such that, if you repeated the sampling procedure many times, approximately 95% of the intervals would contain the true population parameter. The correct interpretation is about the procedure, not any single interval. It is incorrect to say "there is a 95% probability the true value lies in this interval" — the true value either is or is not in the interval; the probability statement is about the long-run frequency of the method. The width of the interval depends on sample size, variability, and confidence level. Wider intervals provide more confidence but less precision. In data science, confidence intervals are more informative than p-values alone because they convey both the estimated effect size and the uncertainty around it.
Q6: What are the key probability distributions a data scientist should know?
Model Answer: The essential distributions are: (1) Normal (Gaussian) — describes many natural phenomena and is central to CLT; parameterized by mean and variance. (2) Binomial — models the number of successes in n independent Bernoulli trials; used for conversion rates and click-through rates. (3) Poisson — models the count of events in a fixed interval; used for page views, customer arrivals, or error counts. (4) Exponential — models time between Poisson events; used for customer lifetime and time-to-event analysis. (5) Uniform — all outcomes equally likely; used in random number generation and as a prior in Bayesian analysis. (6) Beta — models probabilities; commonly used as a prior for conversion rates in Bayesian A/B testing. (7) Log-normal — models positively skewed data like income, stock prices, and session durations. Knowing when to apply each distribution is more important than memorizing their formulas.
Q7: Explain hypothesis testing step by step.
Model Answer: Hypothesis testing follows these steps: (1) State the hypotheses — the null hypothesis H₀ (no effect/no difference) and the alternative hypothesis H₁ (there is an effect). (2) Choose the significance level α (commonly 0.05), which sets the maximum acceptable probability of a Type I error. (3) Select the appropriate test based on data type and assumptions — z-test for large samples with known variance, t-test for small samples or unknown variance, chi-squared for categorical data, etc. (4) Calculate the test statistic from your data. (5) Compute the p-value — the probability of seeing a result at least as extreme under H₀. (6) Make a decision — if p < α, reject H₀; otherwise, fail to reject H₀ (note: "fail to reject" is not the same as "accept"). (7) Report the results with effect size, confidence interval, and practical significance, not just the p-value.
Q8: What is the difference between parametric and non-parametric tests?
Model Answer: Parametric tests (t-test, ANOVA, linear regression) assume the data follows a specific distribution (usually normal) and operate on distribution parameters like the mean. They are more powerful when their assumptions hold. Non-parametric tests (Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, chi-squared) make fewer assumptions about the data distribution and operate on ranks or frequencies. Use non-parametric tests when: (1) data is ordinal or heavily skewed, (2) sample size is too small to verify normality, (3) the data has significant outliers that would distort parametric results. The tradeoff is that non-parametric tests generally have less statistical power, meaning they need larger samples to detect the same effect. In data science, revenue and session duration data are often right-skewed, making non-parametric tests or log transformations common choices.
Q9: What is the difference between Bayesian and frequentist statistics?
Model Answer: The core philosophical difference is in how they interpret probability. Frequentist statistics treats probability as the long-run frequency of events — parameters are fixed but unknown, and inference is based on sampling distributions (p-values, confidence intervals). Bayesian statistics treats probability as a degree of belief — parameters have probability distributions, and inference combines prior beliefs with observed data using Bayes' theorem to produce posterior distributions. Practical differences: Bayesian methods require specifying a prior (which can be controversial), but they provide direct probability statements about parameters ("there is a 95% probability the effect is between X and Y"). Frequentist methods are more established and computationally simpler but can only make indirect probability statements. In industry, Bayesian A/B testing is increasingly popular because it allows "peeking" at results without inflating false positive rates.
Q10: When would you use a t-test vs a z-test?
Model Answer: Use a z-test when: (1) the population standard deviation is known, and (2) the sample size is large (n ≥ 30). The z-test uses the standard normal distribution. Use a t-test when: (1) the population standard deviation is unknown and estimated from the sample, or (2) the sample size is small. The t-distribution has heavier tails than the normal distribution, reflecting the additional uncertainty from estimating the standard deviation. As sample size grows, the t-distribution converges to the normal distribution, so the distinction becomes negligible for large samples. In practice at tech companies, you almost always use t-tests because the true population standard deviation is rarely known, even though with millions of users the difference from a z-test is negligible. Variants include the one-sample t-test, independent two-sample t-test, and paired t-test, each suited for different experimental designs.
Q11: What is statistical power and why does it matter?
Model Answer: Statistical power is the probability that a test correctly rejects a false null hypothesis — in other words, the probability of detecting a real effect when it exists. Power = 1 - β, where β is the Type II error rate. Convention is to aim for power ≥ 0.80, meaning an 80% chance of detecting the effect. Power depends on four factors: (1) sample size — larger samples increase power, (2) effect size — larger effects are easier to detect, (3) significance level α — higher α increases power but also increases false positive risk, (4) variance — less noisy data increases power. Power analysis should be done before an experiment to determine the required sample size. Running underpowered experiments wastes resources and can lead to false negatives, while overpowered experiments waste users and time.
Q12: What is the Law of Large Numbers?
Model Answer: The Law of Large Numbers (LLN) states that as the sample size increases, the sample mean converges to the population mean. There are two forms: the weak LLN says the sample mean converges in probability, and the strong LLN says it converges almost surely. Intuitively, the more data you collect, the closer your sample average gets to the true average. This is different from the Central Limit Theorem, which describes the shape of the sampling distribution (it becomes normal), whereas LLN describes the convergence of the sample mean to the true mean. In practice, LLN justifies why we trust averages computed from large datasets and why larger A/B tests give more reliable estimates of the true treatment effect.
Q13: What is Simpson's Paradox? Give an example.
Model Answer: Simpson's Paradox occurs when a trend that appears in several groups of data reverses or disappears when the groups are combined. Classic example: a new treatment may appear more effective than the standard treatment in both men and women separately, yet appear less effective when the data is combined. This happens because of a confounding variable (in this case, the proportion of men vs women differing between treatment groups). In data science, this arises frequently when analyzing A/B test results across segments. For example, a feature might increase conversion in both mobile and desktop users individually but appear to decrease overall conversion because it shifts traffic toward the lower-converting platform. The key lesson: always segment your analysis by potential confounders before drawing conclusions from aggregate data.
Q14: What is the difference between correlation and causation?
Model Answer: Correlation measures the statistical association between two variables — when one changes, the other tends to change as well. Causation means that changes in one variable directly produce changes in another. Correlation does not imply causation for three reasons: (1) confounding variables — a third variable may cause both (e.g., ice cream sales and drowning deaths both increase in summer because of heat, not because of each other), (2) reverse causality — the direction of cause may be opposite to what you assume, (3) coincidence — with enough variables, spurious correlations are inevitable. To establish causation, you need: randomized controlled experiments (A/B tests), instrumental variables, difference-in-differences, regression discontinuity, or other causal inference techniques. In data science, this distinction is critical: observational data shows correlations, but only properly designed experiments can demonstrate causation.
Q15: How do you handle multiple comparisons / the multiple testing problem?
Model Answer: When you perform multiple statistical tests simultaneously, the probability of at least one false positive increases dramatically. With 20 independent tests at α = 0.05, the probability of at least one false positive is 1 - (0.95)^20 = 64%. Solutions include: (1) Bonferroni correction — divide α by the number of tests; simple but very conservative. (2) Holm-Bonferroni — a step-down procedure that is more powerful than Bonferroni while still controlling family-wise error rate. (3) Benjamini-Hochberg (FDR control) — controls the expected proportion of false discoveries among rejected hypotheses; less conservative and commonly used in genomics and large-scale A/B testing. (4) Pre-registration — specifying which metrics you will test before seeing the data reduces the temptation to cherry-pick. In practice at tech companies, if you are testing 5 metrics in an A/B test, you should apply FDR correction or designate one primary metric and treat others as exploratory.
Pro Tip: In interviews, always mention the practical implications alongside the theoretical answer. For statistics questions, give a real-world data science example to show you understand when and why a concept matters, not just what it is.
Lilly Tech Systems