Intermediate

SQL for Data Science

10 real SQL interview questions with complete solutions. These problems cover window functions, CTEs, self-joins, aggregations, and optimization techniques commonly tested at top tech companies.

Q1: Find users who made purchases on 3 or more consecutive days.

Tables: purchases(user_id, purchase_date, amount)

💡
Solution:
WITH dated_purchases AS (
  SELECT DISTINCT user_id, purchase_date,
    purchase_date - INTERVAL (ROW_NUMBER() OVER (
      PARTITION BY user_id ORDER BY purchase_date
    )) DAY AS grp
  FROM purchases
),
consecutive_groups AS (
  SELECT user_id, grp, COUNT(*) AS consecutive_days
  FROM dated_purchases
  GROUP BY user_id, grp
  HAVING COUNT(*) >= 3
)
SELECT DISTINCT user_id
FROM consecutive_groups;
Explanation: The key technique is the "islands and gaps" pattern. By subtracting a row number from each date, consecutive dates produce the same group value. For example, dates Jan 1, Jan 2, Jan 3 with row numbers 1, 2, 3 give groups Dec 31, Dec 31, Dec 31. We then count rows per group and filter for groups with 3+ days. The DISTINCT in the first CTE handles multiple purchases on the same day.

Q2: Calculate the month-over-month revenue growth rate.

Tables: orders(order_id, order_date, revenue)

💡
Solution:
WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(revenue) AS total_revenue
  FROM orders
  GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
  month,
  total_revenue,
  LAG(total_revenue) OVER (ORDER BY month) AS prev_month_revenue,
  ROUND(100.0 * (total_revenue - LAG(total_revenue) OVER (ORDER BY month))
    / LAG(total_revenue) OVER (ORDER BY month), 2) AS growth_pct
FROM monthly_revenue
ORDER BY month;
Explanation: LAG() is a window function that accesses the previous row's value without a self-join. We first aggregate to monthly revenue, then compute growth as (current - previous) / previous * 100. The first month will show NULL for growth since there is no prior month. In an interview, mention that you would handle division by zero if a month has zero revenue.

Q3: Find the median salary per department.

Tables: employees(emp_id, department, salary)

💡
Solution:
SELECT department,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees
GROUP BY department;

-- Alternative using window functions (works in MySQL):
WITH ranked AS (
  SELECT department, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary) AS rn,
    COUNT(*) OVER (PARTITION BY department) AS cnt
  FROM employees
)
SELECT department, AVG(salary) AS median_salary
FROM ranked
WHERE rn IN (FLOOR((cnt + 1) / 2.0), CEIL((cnt + 1) / 2.0))
GROUP BY department;
Explanation: PERCENTILE_CONT is the cleanest approach (PostgreSQL, SQL Server). The window function approach works in MySQL: we rank salaries, find the middle position(s), and average them. For even-count groups, we average the two middle values. In interviews, always ask which SQL dialect is expected, then choose the appropriate syntax.

Q4: Find the top 3 products by revenue in each category.

Tables: products(product_id, category, product_name), sales(sale_id, product_id, quantity, price)

💡
Solution:
WITH product_revenue AS (
  SELECT p.category, p.product_name,
    SUM(s.quantity * s.price) AS total_revenue
  FROM products p
  JOIN sales s ON p.product_id = s.product_id
  GROUP BY p.category, p.product_name
),
ranked AS (
  SELECT *,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY total_revenue DESC) AS rnk
  FROM product_revenue
)
SELECT category, product_name, total_revenue
FROM ranked
WHERE rnk <= 3
ORDER BY category, rnk;
Explanation: DENSE_RANK() handles ties by giving the same rank to equal values (unlike ROW_NUMBER() which would arbitrarily break ties). If the interviewer wants exactly 3 rows per category regardless of ties, use ROW_NUMBER(). If they want all products tied for third place, use DENSE_RANK(). Always clarify the tie-breaking behavior in interviews — it shows attention to edge cases.

Q5: Calculate a 7-day rolling average of daily active users.

Tables: user_activity(user_id, activity_date)

💡
Solution:
WITH daily_dau AS (
  SELECT activity_date,
    COUNT(DISTINCT user_id) AS dau
  FROM user_activity
  GROUP BY activity_date
)
SELECT activity_date, dau,
  AVG(dau) OVER (
    ORDER BY activity_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7day_avg
FROM daily_dau
ORDER BY activity_date;
Explanation: The window frame ROWS BETWEEN 6 PRECEDING AND CURRENT ROW creates a 7-day window (current day + 6 preceding). Important: use ROWS, not RANGE, to handle gaps in dates correctly. If there are missing dates, you should first generate a complete date spine and LEFT JOIN the activity data. Note that the first 6 days will have a rolling average based on fewer than 7 days; you can filter those out with a QUALIFY clause or a subquery if needed.

Q6: Find users who churned (active in month N but not in month N+1).

Tables: logins(user_id, login_date)

💡
Solution:
WITH monthly_active AS (
  SELECT DISTINCT user_id,
    DATE_TRUNC('month', login_date) AS active_month
  FROM logins
)
SELECT a.user_id, a.active_month AS last_active_month
FROM monthly_active a
LEFT JOIN monthly_active b
  ON a.user_id = b.user_id
  AND b.active_month = a.active_month + INTERVAL '1 month'
WHERE b.user_id IS NULL
ORDER BY a.active_month, a.user_id;
Explanation: This uses a self-join pattern: for each user-month combination, we check if the same user appears in the next month. A LEFT JOIN with a NULL check on the right side identifies users who were active in one month but absent in the next. This is a classic anti-join pattern. An alternative approach uses LEAD(): LEAD(active_month) OVER (PARTITION BY user_id ORDER BY active_month) and checks if the next active month is more than 1 month away.

Q7: Write a query to compute a cumulative sum of daily revenue, resetting each month.

Tables: daily_revenue(date, revenue)

💡
Solution:
SELECT
  date,
  revenue,
  SUM(revenue) OVER (
    PARTITION BY DATE_TRUNC('month', date)
    ORDER BY date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS mtd_revenue
FROM daily_revenue
ORDER BY date;
Explanation: PARTITION BY DATE_TRUNC('month', date) resets the cumulative sum at the start of each month. The window frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ensures we sum from the first row of the partition to the current row. This gives a running month-to-date (MTD) revenue total. This is a very common analytics query for dashboards and financial reporting. The key insight is that PARTITION BY controls the reset boundary.

Q8: Find pairs of users who both purchased the same product at least 3 times.

Tables: purchases(user_id, product_id, purchase_date)

💡
Solution:
WITH user_product_counts AS (
  SELECT user_id, product_id, COUNT(*) AS purchase_count
  FROM purchases
  GROUP BY user_id, product_id
  HAVING COUNT(*) >= 3
)
SELECT a.user_id AS user_1, b.user_id AS user_2, a.product_id,
  a.purchase_count AS user_1_count, b.purchase_count AS user_2_count
FROM user_product_counts a
JOIN user_product_counts b
  ON a.product_id = b.product_id
  AND a.user_id < b.user_id
ORDER BY a.product_id, a.user_id;
Explanation: The self-join with a.user_id < b.user_id is crucial — it prevents duplicate pairs (user A, user B) and (user B, user A), and also prevents pairing a user with themselves. The CTE pre-filters to users who purchased each product 3+ times, reducing the join size. In an interview, discuss the performance implication: the HAVING clause reduces the data before the expensive self-join, which is much better than filtering after.

Q9: Calculate user retention: what percentage of users who signed up in January were still active in March?

Tables: users(user_id, signup_date), activity(user_id, activity_date)

💡
Solution:
WITH jan_signups AS (
  SELECT user_id
  FROM users
  WHERE signup_date >= '2025-01-01'
    AND signup_date < '2025-02-01'
),
march_active AS (
  SELECT DISTINCT user_id
  FROM activity
  WHERE activity_date >= '2025-03-01'
    AND activity_date < '2025-04-01'
)
SELECT
  COUNT(DISTINCT j.user_id) AS jan_cohort_size,
  COUNT(DISTINCT m.user_id) AS retained_in_march,
  ROUND(100.0 * COUNT(DISTINCT m.user_id) / COUNT(DISTINCT j.user_id), 2)
    AS retention_pct
FROM jan_signups j
LEFT JOIN march_active m ON j.user_id = m.user_id;
Explanation: This is a cohort retention query. We define the cohort (January signups), then LEFT JOIN to March activity. Users with no March activity will have NULL from the LEFT JOIN, and COUNT(DISTINCT m.user_id) correctly ignores NULLs. For a complete retention table across multiple months, you would use CROSS JOIN with a months table and compute retention for each cohort-month combination. In interviews, mention that this is "N-month retention" (specifically 2-month retention here) and discuss how it differs from rolling retention.

Q10: Given a table of page views, find the most common user journey (sequence of 3 consecutive pages).

Tables: page_views(user_id, page, view_time)

💡
Solution:
WITH ordered_views AS (
  SELECT user_id, page,
    LEAD(page, 1) OVER (PARTITION BY user_id ORDER BY view_time) AS page_2,
    LEAD(page, 2) OVER (PARTITION BY user_id ORDER BY view_time) AS page_3
  FROM page_views
)
SELECT page AS step_1, page_2 AS step_2, page_3 AS step_3,
  COUNT(*) AS journey_count
FROM ordered_views
WHERE page_2 IS NOT NULL AND page_3 IS NOT NULL
GROUP BY page, page_2, page_3
ORDER BY journey_count DESC
LIMIT 10;
Explanation: LEAD(page, 1) and LEAD(page, 2) look ahead 1 and 2 rows within each user's ordered page views to create sequences of 3 consecutive pages. The WHERE clause filters out incomplete sequences at the end of each user's session. We then GROUP BY the 3-page sequence and count occurrences. This is a common product analytics query for understanding user flows. In an interview, discuss how you might also want to sessionize the data (group views into sessions based on 30-minute inactivity gaps) before computing journeys.
Pro Tip: In SQL interviews, always talk through your approach before writing code. Mention edge cases (NULLs, duplicates, ties), ask about the SQL dialect (PostgreSQL vs MySQL vs BigQuery), and discuss query performance for large tables. These habits demonstrate production-level thinking.