Intermediate

GroupBy & Aggregation

GroupBy is the most commonly tested pandas operation in data science interviews. It mirrors SQL's GROUP BY and is essential for feature engineering, reporting, and data analysis. These 6 challenges cover every pattern you need.

💡

Why this matters for ML: Feature engineering for ML models relies heavily on groupby operations: computing per-customer statistics, per-category averages, rolling aggregations for time-series features, and ratio features. Mastering groupby is non-negotiable for any ML engineer.

Shared Dataset for This Lesson

import pandas as pd
import numpy as np

# Sales dataset with multiple dimensions
sales = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-05', '2024-01-05', '2024-01-12', '2024-01-12',
                            '2024-01-19', '2024-01-19', '2024-02-02', '2024-02-02',
                            '2024-02-09', '2024-02-09', '2024-02-16', '2024-02-16',
                            '2024-03-01', '2024-03-01', '2024-03-08', '2024-03-08']),
    'store': ['NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA',
              'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA', 'NYC', 'LA'],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Electronics',
                 'Clothing', 'Electronics', 'Clothing', 'Clothing',
                 'Electronics', 'Electronics', 'Clothing', 'Electronics',
                 'Electronics', 'Clothing', 'Clothing', 'Electronics'],
    'revenue': [15000, 8000, 12000, 22000, 9500, 18000, 7000, 11000,
                19000, 14000, 8500, 16000, 21000, 9000, 10000, 25000],
    'units': [50, 120, 40, 73, 140, 60, 100, 160, 63, 47, 125, 53,
              70, 130, 145, 83],
    'returns': [3, 8, 2, 5, 10, 4, 7, 12, 4, 3, 9, 4, 5, 11, 13, 6]
})

Challenge 1: Basic agg() with Multiple Functions

📝

Problem: For each store, compute: total revenue, average revenue per transaction, total units sold, and the number of transactions. Sort by total revenue descending.

Solution

# Using agg() with a dictionary of column: function mappings
store_summary = sales.groupby('store').agg(
    total_revenue=('revenue', 'sum'),
    avg_revenue=('revenue', 'mean'),
    total_units=('units', 'sum'),
    num_transactions=('revenue', 'count')
).sort_values('total_revenue', ascending=False)

print(store_summary)
#       total_revenue  avg_revenue  total_units  num_transactions
# store
# NYC          101500    12687.500          633                 8
# LA           123000    15375.000          596                 8

# SQL equivalent:
# SELECT store,
#        SUM(revenue) AS total_revenue,
#        AVG(revenue) AS avg_revenue,
#        SUM(units) AS total_units,
#        COUNT(*) AS num_transactions
# FROM sales
# GROUP BY store
# ORDER BY total_revenue DESC

Challenge 2: transform() for Group-Level Broadcasting

📝

Problem: (a) Add a column showing each transaction's revenue as a percentage of its store's total revenue. (b) Add a column showing the deviation of each transaction's revenue from its store's mean. (c) Flag transactions where revenue is more than 1 standard deviation above the store mean.

Solution

# transform() returns a Series with the SAME index as the original DataFrame,
# broadcasting the group result back to each row.

# (a) Revenue as percentage of store total
sales['pct_of_store'] = (
    sales['revenue'] / sales.groupby('store')['revenue'].transform('sum') * 100
).round(1)

print(sales[['store', 'revenue', 'pct_of_store']].head(6))
#   store  revenue  pct_of_store
# 0   NYC    15000          14.8
# 1    LA     8000           6.5
# 2   NYC    12000          11.8
# 3    LA    22000          17.9
# 4   NYC     9500           9.4
# 5    LA    18000          14.6


# (b) Deviation from store mean
sales['revenue_deviation'] = (
    sales['revenue'] - sales.groupby('store')['revenue'].transform('mean')
).round(0)


# (c) Flag above-average transactions (> 1 std above mean)
store_mean = sales.groupby('store')['revenue'].transform('mean')
store_std = sales.groupby('store')['revenue'].transform('std')
sales['is_outlier'] = sales['revenue'] > (store_mean + store_std)

outliers = sales[sales['is_outlier']]
print(outliers[['store', 'category', 'revenue', 'is_outlier']])
#    store     category  revenue  is_outlier
# 3     LA  Electronics    22000        True
# 12   NYC  Electronics    21000        True
# 15    LA  Electronics    25000        True

# KEY INSIGHT: transform() keeps the original DataFrame shape.
# agg() reduces to one row per group.
# Use transform() when you need group stats as new columns.

💡

Interview favorite: "Add a column showing each row's value as a percentage of its group total" is one of the most frequently asked pandas questions. The answer is always transform('sum') in the denominator. This pattern is equivalent to SQL's SUM(revenue) OVER (PARTITION BY store).

Challenge 3: apply() for Complex Group Operations

📝

Problem: (a) For each store, find the top 2 transactions by revenue. (b) For each store-category combination, compute the return rate (returns/units) and find the combination with the highest return rate. (c) For each store, compute the revenue-weighted average price per unit.

Solution

# (a) Top 2 transactions per store
top2_per_store = (
    sales.groupby('store', group_keys=False)
    .apply(lambda g: g.nlargest(2, 'revenue'))
)
print(top2_per_store[['store', 'category', 'revenue']])
#    store     category  revenue
# 12   NYC  Electronics    21000
# 8    NYC  Electronics    19000
# 15    LA  Electronics    25000
# 3     LA  Electronics    22000

# SQL equivalent:
# SELECT * FROM (
#   SELECT *, ROW_NUMBER() OVER (PARTITION BY store ORDER BY revenue DESC) AS rn
#   FROM sales
# ) t WHERE rn <= 2


# (b) Return rate per store-category
return_rates = (
    sales.groupby(['store', 'category'])
    .apply(lambda g: pd.Series({
        'total_returns': g['returns'].sum(),
        'total_units': g['units'].sum(),
        'return_rate': (g['returns'].sum() / g['units'].sum() * 100).round(2)
    }))
)
print(return_rates.sort_values('return_rate', ascending=False))
#                      total_returns  total_units  return_rate
# store category
# LA    Clothing              31.0        410.0         7.56
# NYC   Clothing              39.0        510.0         7.65
# NYC   Electronics           14.0        223.0         6.28
# LA    Electronics           16.0        183.0         8.74

worst = return_rates['return_rate'].idxmax()
print(f"Highest return rate: {worst} at {return_rates.loc[worst, 'return_rate']}%")


# (c) Revenue-weighted average price per unit
def weighted_avg_price(group):
    """Revenue-weighted average price = total revenue / total units."""
    return group['revenue'].sum() / group['units'].sum()

wavg = sales.groupby('store').apply(weighted_avg_price).round(2)
wavg.name = 'weighted_avg_price'
print(wavg)
# store
# NYC    160.35
# LA     206.38
# Name: weighted_avg_price, dtype: float64

Challenge 4: Multiple Aggregations on Different Columns

📝

Problem: Create a comprehensive monthly report by store showing: total and mean revenue, total units, return rate, best-selling category (by revenue), and number of distinct categories sold.

Solution

# Extract month for grouping
sales['month'] = sales['date'].dt.to_period('M')

# Approach: combine named aggregation with apply for complex metrics
basic_agg = sales.groupby(['month', 'store']).agg(
    total_revenue=('revenue', 'sum'),
    mean_revenue=('revenue', 'mean'),
    total_units=('units', 'sum'),
    total_returns=('returns', 'sum'),
    n_categories=('category', 'nunique')
)

# Add return rate
basic_agg['return_rate'] = (
    basic_agg['total_returns'] / basic_agg['total_units'] * 100
).round(2)

# Add best category per group using a separate groupby + apply
best_cat = (
    sales.groupby(['month', 'store'])
    .apply(lambda g: g.groupby('category')['revenue'].sum().idxmax())
)
best_cat.name = 'best_category'

# Combine
report = basic_agg.join(best_cat)
print(report)
#               total_revenue  mean_revenue  total_units  total_returns  \
# month   store
# 2024-01 NYC          36500       12166.7          230             15
#         LA           48000       16000.0          253             17
# 2024-02 NYC          34500       11500.0          288             20
#         LA           41000       13666.7          260             19
# 2024-03 NYC          31000       15500.0          215             18
#         LA           34000       17000.0          213             17
#
#               n_categories  return_rate best_category
# month   store
# 2024-01 NYC              2         6.52   Electronics
#         LA               2         6.72   Electronics
# 2024-02 NYC              2         6.94   Electronics
#         LA               2         7.31      Clothing
# 2024-03 NYC              2         8.37   Electronics
#         LA               2         7.98   Electronics

Challenge 5: Named Aggregations (Modern Pandas)

📝

Problem: Using the modern named aggregation syntax (pandas ≥ 0.25), compute per-category stats: sum of revenue, mean units, max single-transaction revenue, min single-transaction revenue, and the standard deviation of revenue. Then compute a coefficient of variation (std/mean) to identify which category has more volatile sales.

Solution

# Named aggregation syntax: .agg(new_col_name=('source_col', 'agg_func'))
# This is the modern, recommended approach since pandas 0.25

category_stats = sales.groupby('category').agg(
    total_revenue=('revenue', 'sum'),
    avg_units=('units', 'mean'),
    max_revenue=('revenue', 'max'),
    min_revenue=('revenue', 'min'),
    std_revenue=('revenue', 'std'),
    mean_revenue=('revenue', 'mean')
).round(2)

# Coefficient of variation = std / mean (higher = more volatile)
category_stats['cv'] = (
    category_stats['std_revenue'] / category_stats['mean_revenue']
).round(3)

print(category_stats)
#              total_revenue  avg_units  max_revenue  min_revenue  std_revenue  \
# category
# Clothing           63000     131.25        11000         7000      1309.31
# Electronics       162000      54.50        25000        12000      4230.84
#
#              mean_revenue     cv
# category
# Clothing         7875.00  0.166
# Electronics     20250.00  0.209

print(f"\nMore volatile category: Electronics (CV={category_stats.loc['Electronics', 'cv']})")
print("Electronics has higher CV, meaning its sales vary more relative to its mean.")

# NOTE: Named aggregation advantages over dict-based agg:
# 1. Output column names are explicit (no MultiIndex)
# 2. You can apply different functions to the same column
# 3. Code is more readable and self-documenting

Challenge 6: Custom Aggregation Functions

📝

Problem: (a) Compute the interquartile range (IQR) of revenue per store. (b) For each store, compute the percentage of transactions that exceeded $15,000. (c) For each store, compute a "health score" defined as: (total_revenue / 1000) - (return_rate * 10). Higher is better.

Solution

# (a) IQR per store using a custom function
def iqr(series):
    """Interquartile range: Q3 - Q1."""
    return series.quantile(0.75) - series.quantile(0.25)

revenue_iqr = sales.groupby('store')['revenue'].agg(['median', iqr])
print(revenue_iqr)
#       median      iqr
# store
# LA     15000   9500.0
# NYC    13500   8625.0


# (b) Percentage of transactions above $15,000
def pct_above_threshold(series, threshold=15000):
    """Percentage of values above a given threshold."""
    return (series > threshold).mean() * 100

high_revenue_pct = sales.groupby('store')['revenue'].agg(pct_above_threshold)
print(f"\n% of transactions above $15K:")
print(high_revenue_pct.round(1))
# store
# LA     50.0
# NYC    37.5


# (c) Health score combining multiple metrics
def health_score(group):
    """Custom health metric combining revenue and return rate."""
    total_rev = group['revenue'].sum()
    return_rate = group['returns'].sum() / group['units'].sum() * 100
    score = (total_rev / 1000) - (return_rate * 10)
    return round(score, 2)

scores = sales.groupby('store').apply(health_score)
scores.name = 'health_score'
print(f"\nStore Health Scores:")
print(scores)
# store
# LA     49.66
# NYC    28.25

# Putting it all together: comprehensive custom report
def store_report(group):
    return pd.Series({
        'total_revenue': group['revenue'].sum(),
        'revenue_iqr': iqr(group['revenue']),
        'pct_above_15k': pct_above_threshold(group['revenue']),
        'return_rate': (group['returns'].sum() / group['units'].sum() * 100).round(2),
        'health_score': health_score(group)
    })

full_report = sales.groupby('store').apply(store_report)
print(f"\nFull Report:")
print(full_report)

💡

Performance hierarchy: Built-in aggregations ('sum', 'mean') are fastest because they use optimized C code. Named aggregations are next. Custom functions passed to apply() are slowest because they run Python-level loops. For large datasets, try to express your logic using built-in agg functions and vectorized math, reserving apply() only for logic that truly cannot be vectorized.

← Previous Data Selection & Filtering Next → Merge, Join & Concat