Intermediate
GroupBy & Aggregation
GroupBy is the most commonly tested pandas operation in data science interviews. It mirrors SQL's GROUP BY and is essential for feature engineering, reporting, and data analysis. These 6 challenges cover every pattern you need.
Why this matters for ML: Feature engineering for ML models relies heavily on groupby operations: computing per-customer statistics, per-category averages, rolling aggregations for time-series features, and ratio features. Mastering groupby is non-negotiable for any ML engineer.
Shared Dataset for This Lesson
import pandas as pd
import numpy as np
# Sales dataset with multiple dimensions
sales = pd.DataFrame({
'date': pd.to_datetime(['2024-01-05', '2024-01-05', '2024-01-12', '2024-01-12',
'2024-01-19', '2024-01-19', '2024-02-02', '2024-02-02',
'2024-02-09', '2024-02-09', '2024-02-16', '2024-02-16',
'2024-03-01', '2024-03-01', '2024-03-08', '2024-03-08']),
'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA',
'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
'category': ['Electronics', 'Clothing', 'Electronics', 'Electronics',
'Clothing', 'Electronics', 'Clothing', 'Clothing',
'Electronics', 'Electronics', 'Clothing', 'Electronics',
'Electronics', 'Clothing', 'Clothing', 'Electronics'],
'revenue': [15000, 8000, 12000, 22000, 9500, 18000, 7000, 11000,
19000, 14000, 8500, 16000, 21000, 9000, 10000, 25000],
'units': [50, 120, 40, 73, 140, 60, 100, 160, 63, 47, 125, 53,
70, 130, 145, 83],
'returns': [3, 8, 2, 5, 10, 4, 7, 12, 4, 3, 9, 4, 5, 11, 13, 6]
})
Challenge 1: Basic agg() with Multiple Functions
Problem: For each store, compute: total revenue, average revenue per transaction, total units sold, and the number of transactions. Sort by total revenue descending.
Solution
# Using agg() with a dictionary of column: function mappings
store_summary = sales.groupby('store').agg(
total_revenue=('revenue', 'sum'),
avg_revenue=('revenue', 'mean'),
total_units=('units', 'sum'),
num_transactions=('revenue', 'count')
).sort_values('total_revenue', ascending=False)
print(store_summary)
# total_revenue avg_revenue total_units num_transactions
# store
# NYC 101500 12687.500 633 8
# LA 123000 15375.000 596 8
# SQL equivalent:
# SELECT store,
# SUM(revenue) AS total_revenue,
# AVG(revenue) AS avg_revenue,
# SUM(units) AS total_units,
# COUNT(*) AS num_transactions
# FROM sales
# GROUP BY store
# ORDER BY total_revenue DESC
Challenge 2: transform() for Group-Level Broadcasting
Problem: (a) Add a column showing each transaction's revenue as a percentage of its store's total revenue. (b) Add a column showing the deviation of each transaction's revenue from its store's mean. (c) Flag transactions where revenue is more than 1 standard deviation above the store mean.
Solution
# transform() returns a Series with the SAME index as the original DataFrame,
# broadcasting the group result back to each row.
# (a) Revenue as percentage of store total
sales['pct_of_store'] = (
sales['revenue'] / sales.groupby('store')['revenue'].transform('sum') * 100
).round(1)
print(sales[['store', 'revenue', 'pct_of_store']].head(6))
# store revenue pct_of_store
# 0 NYC 15000 14.8
# 1 LA 8000 6.5
# 2 NYC 12000 11.8
# 3 LA 22000 17.9
# 4 NYC 9500 9.4
# 5 LA 18000 14.6
# (b) Deviation from store mean
sales['revenue_deviation'] = (
sales['revenue'] - sales.groupby('store')['revenue'].transform('mean')
).round(0)
# (c) Flag above-average transactions (> 1 std above mean)
store_mean = sales.groupby('store')['revenue'].transform('mean')
store_std = sales.groupby('store')['revenue'].transform('std')
sales['is_outlier'] = sales['revenue'] > (store_mean + store_std)
outliers = sales[sales['is_outlier']]
print(outliers[['store', 'category', 'revenue', 'is_outlier']])
# store category revenue is_outlier
# 3 LA Electronics 22000 True
# 12 NYC Electronics 21000 True
# 15 LA Electronics 25000 True
# KEY INSIGHT: transform() keeps the original DataFrame shape.
# agg() reduces to one row per group.
# Use transform() when you need group stats as new columns.
Interview favorite: "Add a column showing each row's value as a percentage of its group total" is one of the most frequently asked pandas questions. The answer is always
transform('sum') in the denominator. This pattern is equivalent to SQL's SUM(revenue) OVER (PARTITION BY store).Challenge 3: apply() for Complex Group Operations
Problem: (a) For each store, find the top 2 transactions by revenue. (b) For each store-category combination, compute the return rate (returns/units) and find the combination with the highest return rate. (c) For each store, compute the revenue-weighted average price per unit.
Solution
# (a) Top 2 transactions per store
top2_per_store = (
sales.groupby('store', group_keys=False)
.apply(lambda g: g.nlargest(2, 'revenue'))
)
print(top2_per_store[['store', 'category', 'revenue']])
# store category revenue
# 12 NYC Electronics 21000
# 8 NYC Electronics 19000
# 15 LA Electronics 25000
# 3 LA Electronics 22000
# SQL equivalent:
# SELECT * FROM (
# SELECT *, ROW_NUMBER() OVER (PARTITION BY store ORDER BY revenue DESC) AS rn
# FROM sales
# ) t WHERE rn <= 2
# (b) Return rate per store-category
return_rates = (
sales.groupby(['store', 'category'])
.apply(lambda g: pd.Series({
'total_returns': g['returns'].sum(),
'total_units': g['units'].sum(),
'return_rate': (g['returns'].sum() / g['units'].sum() * 100).round(2)
}))
)
print(return_rates.sort_values('return_rate', ascending=False))
# total_returns total_units return_rate
# store category
# LA Clothing 31.0 410.0 7.56
# NYC Clothing 39.0 510.0 7.65
# NYC Electronics 14.0 223.0 6.28
# LA Electronics 16.0 183.0 8.74
worst = return_rates['return_rate'].idxmax()
print(f"Highest return rate: {worst} at {return_rates.loc[worst, 'return_rate']}%")
# (c) Revenue-weighted average price per unit
def weighted_avg_price(group):
"""Revenue-weighted average price = total revenue / total units."""
return group['revenue'].sum() / group['units'].sum()
wavg = sales.groupby('store').apply(weighted_avg_price).round(2)
wavg.name = 'weighted_avg_price'
print(wavg)
# store
# NYC 160.35
# LA 206.38
# Name: weighted_avg_price, dtype: float64
Challenge 4: Multiple Aggregations on Different Columns
Problem: Create a comprehensive monthly report by store showing: total and mean revenue, total units, return rate, best-selling category (by revenue), and number of distinct categories sold.
Solution
# Extract month for grouping
sales['month'] = sales['date'].dt.to_period('M')
# Approach: combine named aggregation with apply for complex metrics
basic_agg = sales.groupby(['month', 'store']).agg(
total_revenue=('revenue', 'sum'),
mean_revenue=('revenue', 'mean'),
total_units=('units', 'sum'),
total_returns=('returns', 'sum'),
n_categories=('category', 'nunique')
)
# Add return rate
basic_agg['return_rate'] = (
basic_agg['total_returns'] / basic_agg['total_units'] * 100
).round(2)
# Add best category per group using a separate groupby + apply
best_cat = (
sales.groupby(['month', 'store'])
.apply(lambda g: g.groupby('category')['revenue'].sum().idxmax())
)
best_cat.name = 'best_category'
# Combine
report = basic_agg.join(best_cat)
print(report)
# total_revenue mean_revenue total_units total_returns \
# month store
# 2024-01 NYC 36500 12166.7 230 15
# LA 48000 16000.0 253 17
# 2024-02 NYC 34500 11500.0 288 20
# LA 41000 13666.7 260 19
# 2024-03 NYC 31000 15500.0 215 18
# LA 34000 17000.0 213 17
#
# n_categories return_rate best_category
# month store
# 2024-01 NYC 2 6.52 Electronics
# LA 2 6.72 Electronics
# 2024-02 NYC 2 6.94 Electronics
# LA 2 7.31 Clothing
# 2024-03 NYC 2 8.37 Electronics
# LA 2 7.98 Electronics
Challenge 5: Named Aggregations (Modern Pandas)
Problem: Using the modern named aggregation syntax (pandas ≥ 0.25), compute per-category stats: sum of revenue, mean units, max single-transaction revenue, min single-transaction revenue, and the standard deviation of revenue. Then compute a coefficient of variation (std/mean) to identify which category has more volatile sales.
Solution
# Named aggregation syntax: .agg(new_col_name=('source_col', 'agg_func'))
# This is the modern, recommended approach since pandas 0.25
category_stats = sales.groupby('category').agg(
total_revenue=('revenue', 'sum'),
avg_units=('units', 'mean'),
max_revenue=('revenue', 'max'),
min_revenue=('revenue', 'min'),
std_revenue=('revenue', 'std'),
mean_revenue=('revenue', 'mean')
).round(2)
# Coefficient of variation = std / mean (higher = more volatile)
category_stats['cv'] = (
category_stats['std_revenue'] / category_stats['mean_revenue']
).round(3)
print(category_stats)
# total_revenue avg_units max_revenue min_revenue std_revenue \
# category
# Clothing 63000 131.25 11000 7000 1309.31
# Electronics 162000 54.50 25000 12000 4230.84
#
# mean_revenue cv
# category
# Clothing 7875.00 0.166
# Electronics 20250.00 0.209
print(f"\nMore volatile category: Electronics (CV={category_stats.loc['Electronics', 'cv']})")
print("Electronics has higher CV, meaning its sales vary more relative to its mean.")
# NOTE: Named aggregation advantages over dict-based agg:
# 1. Output column names are explicit (no MultiIndex)
# 2. You can apply different functions to the same column
# 3. Code is more readable and self-documenting
Challenge 6: Custom Aggregation Functions
Problem: (a) Compute the interquartile range (IQR) of revenue per store. (b) For each store, compute the percentage of transactions that exceeded $15,000. (c) For each store, compute a "health score" defined as: (total_revenue / 1000) - (return_rate * 10). Higher is better.
Solution
# (a) IQR per store using a custom function
def iqr(series):
"""Interquartile range: Q3 - Q1."""
return series.quantile(0.75) - series.quantile(0.25)
revenue_iqr = sales.groupby('store')['revenue'].agg(['median', iqr])
print(revenue_iqr)
# median iqr
# store
# LA 15000 9500.0
# NYC 13500 8625.0
# (b) Percentage of transactions above $15,000
def pct_above_threshold(series, threshold=15000):
"""Percentage of values above a given threshold."""
return (series > threshold).mean() * 100
high_revenue_pct = sales.groupby('store')['revenue'].agg(pct_above_threshold)
print(f"\n% of transactions above $15K:")
print(high_revenue_pct.round(1))
# store
# LA 50.0
# NYC 37.5
# (c) Health score combining multiple metrics
def health_score(group):
"""Custom health metric combining revenue and return rate."""
total_rev = group['revenue'].sum()
return_rate = group['returns'].sum() / group['units'].sum() * 100
score = (total_rev / 1000) - (return_rate * 10)
return round(score, 2)
scores = sales.groupby('store').apply(health_score)
scores.name = 'health_score'
print(f"\nStore Health Scores:")
print(scores)
# store
# LA 49.66
# NYC 28.25
# Putting it all together: comprehensive custom report
def store_report(group):
return pd.Series({
'total_revenue': group['revenue'].sum(),
'revenue_iqr': iqr(group['revenue']),
'pct_above_15k': pct_above_threshold(group['revenue']),
'return_rate': (group['returns'].sum() / group['units'].sum() * 100).round(2),
'health_score': health_score(group)
})
full_report = sales.groupby('store').apply(store_report)
print(f"\nFull Report:")
print(full_report)
Performance hierarchy: Built-in aggregations (
'sum', 'mean') are fastest because they use optimized C code. Named aggregations are next. Custom functions passed to apply() are slowest because they run Python-level loops. For large datasets, try to express your logic using built-in agg functions and vectorized math, reserving apply() only for logic that truly cannot be vectorized.
Lilly Tech Systems