Intermediate

A/B Testing Questions

12 essential A/B testing interview questions and model answers covering experiment design, sample size calculation, statistical significance, multi-arm bandits, and common pitfalls that trip up candidates.

Q1: Walk me through how you would design an A/B test from scratch.

💡

Model Answer: I follow a structured 7-step process: (1) Define the hypothesis — clearly state the change, the expected effect, and the direction (e.g., "Adding social proof badges will increase checkout conversion by at least 2%"). (2) Choose the primary metric (OEC — Overall Evaluation Criterion) and guardrail metrics that should not degrade. (3) Determine the randomization unit — typically user-level, but sometimes session or page-level depending on the feature. (4) Calculate the required sample size using power analysis (significance level, power, minimum detectable effect, baseline variance). (5) Run the experiment for the pre-determined duration, resisting the urge to peek and stop early. (6) Analyze results — check for statistical significance, practical significance, and segment effects. (7) Make a decision — ship, iterate, or kill the feature, documenting learnings regardless of outcome.

Q2: How do you calculate the required sample size for an A/B test?

💡

Model Answer: Sample size depends on four inputs: (1) Significance level α (typically 0.05), (2) Statistical power 1-β (typically 0.80), (3) Minimum detectable effect (MDE) — the smallest effect you care about detecting, and (4) Baseline variance of the metric. For a two-proportion z-test (e.g., conversion rate): n per group ≈ (Z₁₋α/₂ + Z₁₋β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)². For a 5% baseline conversion rate and 5% relative MDE (absolute MDE = 0.25%), you need roughly 250,000 users per group. Key tradeoffs: smaller MDE requires more samples, higher power requires more samples, higher confidence requires more samples. In practice, I use a power calculator and then add 10-20% for attrition and potential variance inflation.

Q3: What is the problem with "peeking" at A/B test results?

💡

Model Answer: Peeking refers to repeatedly checking A/B test results before the planned sample size is reached and stopping as soon as significance is achieved. This dramatically inflates the false positive rate. With a planned α = 0.05, peeking daily over a 30-day test can push the actual false positive rate to 20-30%. This happens because each peek is essentially an additional hypothesis test, and the more times you test, the higher the chance of seeing a false positive (multiple comparisons problem). Solutions: (1) Sequential testing methods (e.g., O'Brien-Fleming boundaries) that adjust significance thresholds to account for multiple looks. (2) Bayesian A/B testing, which does not suffer from the peeking problem because it does not rely on p-values. (3) Pre-commit to a fixed sample size and do not analyze until you reach it. Most modern experimentation platforms (like Optimizely or Statsig) implement sequential testing by default to handle this.

Q4: What is the novelty effect and how do you account for it?

💡

Model Answer: The novelty effect occurs when users engage more with a new feature simply because it is new and different, not because it is actually better. This inflates early experiment results and can lead to false positives. Similarly, there is a change aversion effect where users engage less with changes simply because they are unfamiliar. To account for these: (1) Run experiments longer (at least 2-3 weeks) to let the novelty wear off and observe steady-state behavior. (2) Segment by new vs returning users — new users have no baseline to compare against, so they are less affected by novelty. (3) Plot the treatment effect over time — if the effect diminishes steadily, novelty is likely a factor. (4) Use a holdout group — keep a small percentage unexposed for long-term measurement. At companies like Netflix and LinkedIn, the standard practice is to discount early results and focus on the steady-state treatment effect.

Q5: What are network effects and how do they violate A/B testing assumptions?

💡

Model Answer: A/B testing assumes that the treatment on one user does not affect outcomes for other users (the Stable Unit Treatment Value Assumption, or SUTVA). Network effects violate this assumption. Example: if you are testing a new messaging feature, users in the treatment group may message users in the control group, exposing them to the treatment indirectly. This causes the treatment effect to "leak" into the control group, biasing your estimate toward zero. Solutions: (1) Cluster randomization — randomize at the level of social clusters or geographic regions instead of individual users, so the treatment and control groups have minimal interaction. (2) Ego-network randomization — randomize based on a user and their immediate connections. (3) Switchback experiments — alternate between treatment and control over time periods (common for marketplace experiments at Uber and Lyft). Each approach has tradeoffs between reducing interference and maintaining statistical power.

Q6: What is a multi-arm bandit and when would you use it instead of a traditional A/B test?

💡

Model Answer: A multi-arm bandit (MAB) is an adaptive experimentation approach that dynamically allocates traffic to better-performing variants over time, balancing exploration (learning) with exploitation (maximizing reward). Unlike traditional A/B testing which splits traffic 50/50 for a fixed duration, MAB algorithms (like Thompson Sampling or Upper Confidence Bound) gradually shift traffic toward the winning variant. Use MAB when: (1) opportunity cost is high — showing a worse variant to 50% of users for weeks is expensive, (2) you have many variants — testing 10 headlines with equal splits wastes traffic, (3) the context changes over time — contextual bandits can adapt. Stick with traditional A/B tests when: (1) you need rigorous causal inference with clear statistical guarantees, (2) you need to measure long-term effects, (3) stakeholders require traditional p-values and confidence intervals. Most companies use A/B tests for product decisions and bandits for optimization problems like ad ranking or content personalization.

Q7: Your A/B test shows a statistically significant result but a tiny effect size. What do you do?

💡

Model Answer: Statistical significance alone is not sufficient for a launch decision. I would evaluate practical significance by considering: (1) Business impact — even a 0.01% increase in conversion might be worth millions annually for a large platform, so compute the expected revenue impact. (2) Implementation cost — if the feature adds complexity, maintenance burden, or latency, a tiny improvement may not justify it. (3) Guardrail metrics — check if other important metrics degraded (engagement, latency, error rates). (4) Segment analysis — the overall effect might be small because it helps some users a lot while being neutral for others. (5) Long-term effects — some features have small short-term effects but compound over time (e.g., improved recommendations). My recommendation would depend on the full picture: business impact, cost, and risk. I would present the data with confidence intervals and let the product team make an informed decision.

Q8: How do you handle multiple metrics in an A/B test?

💡

Model Answer: Multiple metrics create two challenges: (1) the multiple comparisons problem (inflated false positive rate), and (2) conflicting results across metrics. My approach: Before the test, designate one primary metric (OEC) that drives the ship/no-ship decision. Define guardrail metrics that must not degrade (e.g., latency, crash rate, revenue). List secondary metrics for deeper understanding. During analysis: apply Bonferroni or Benjamini-Hochberg correction to secondary metrics. If the primary metric is significant and no guardrails are violated, ship. If metrics conflict (e.g., engagement up but revenue down), investigate the tradeoff, consider whether it reflects a short-term vs long-term difference, and escalate to leadership with data. At companies like Microsoft and Google, the OEC is pre-registered and the experimentation platform automatically flags guardrail violations.

Q9: What is Simpson's Paradox in the context of A/B testing?

💡

Model Answer: In A/B testing, Simpson's Paradox occurs when the treatment appears better overall but worse in every segment (or vice versa). This happens when the treatment changes the proportion of users in different segments. Example: a new checkout flow increases conversion for both mobile (5% to 6%) and desktop (10% to 11%) users. But if the new flow also shifts traffic toward mobile (which has lower conversion overall), the aggregate conversion might actually decrease. The paradox arises from a confounding variable (device type) that is correlated with both the treatment and the outcome. To avoid being misled: (1) always segment results by key dimensions (platform, country, user tenure), (2) check if the treatment changes segment composition, (3) use stratified analysis or regression adjustments to control for confounders. If you see conflicting aggregate vs segment results, the segment-level results are usually more trustworthy because they control for the confounder.

Q10: How do you handle an A/B test when the metric is highly skewed (e.g., revenue)?

💡

Model Answer: Revenue data is typically heavily right-skewed with a few high-value outliers that can dominate the mean and inflate variance, requiring enormous sample sizes. Strategies: (1) Winsorize or cap outliers — replace values above the 99th percentile with the 99th percentile value, which reduces variance without dropping data. (2) Log-transform the metric and compare geometric means instead of arithmetic means. (3) Use non-parametric tests like Mann-Whitney U, which compare ranks rather than means. (4) Use the delta method or ratio metrics (e.g., revenue per session) which can have lower variance. (5) Bootstrap confidence intervals rather than relying on CLT-based intervals, since the CLT convergence is slow for heavily skewed distributions. (6) CUPED (Controlled-experiment Using Pre-Experiment Data) — use pre-experiment behavior as a covariate to reduce variance. At most tech companies, CUPED combined with winsorization is the standard approach for revenue experiments.

Q11: What are guardrail metrics and why are they important?

💡

Model Answer: Guardrail metrics are metrics that you monitor during an A/B test to ensure the experiment is not causing unintended harm, even if the primary metric improves. They fall into two categories: (1) Trust guardrails — sanity checks that ensure the experiment is running correctly. Examples: sample ratio mismatch (SRM) checks to verify randomization is working, and latency metrics to ensure the new code does not slow the site. (2) Business guardrails — important metrics that should not degrade. Example: if testing a new feed algorithm to increase engagement, guardrails might include revenue, ad click-through rate, and user retention. A feature that boosts engagement by 5% but reduces revenue by 3% should not automatically ship. The rule is: if any guardrail is significantly violated, halt the experiment and investigate before proceeding. Pre-define guardrails and their acceptable bounds before launching the experiment.

Q12: What is CUPED and how does it improve A/B testing?

💡

Model Answer: CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique developed at Microsoft. The core idea: use each user's pre-experiment behavior as a covariate to reduce the variance of the treatment effect estimate. Formally, instead of comparing raw metrics Y between treatment and control, CUPED computes an adjusted metric: Ŷ = Y - θ(X - E[X]), where X is the pre-experiment value of the same metric and θ = Cov(X,Y)/Var(X). This adjustment removes variance due to individual differences that existed before the experiment. The result: the same effect can be detected with 20-50% fewer samples, effectively making experiments run faster. CUPED works best when the pre-experiment covariate is highly correlated with the post-experiment metric (which is almost always the case — a user's past engagement strongly predicts their future engagement). It is now standard at Microsoft, Netflix, Uber, and most mature experimentation platforms.

✅

Pro Tip: In interviews, demonstrating knowledge of practical A/B testing challenges (network effects, novelty, skewed metrics, CUPED) sets you apart from candidates who only know the textbook hypothesis testing procedure. Companies want data scientists who have thought about what goes wrong in real experiments.

← Previous Probability Questions Next → SQL for Data Science