Advanced
Pivot & Reshape
Reshaping data between long and wide formats is a critical skill for data analysis and ML feature preparation. These 5 challenges cover every reshape operation you need for interviews.
Why this matters for ML: ML models require data in specific shapes. User-item interaction matrices for recommender systems need pivot. Survey responses need melt. Categorical encoding needs get_dummies. Knowing when and how to reshape data is what separates data engineers from analysts.
Shared Dataset for This Lesson
import pandas as pd
import numpy as np
# Survey / experiment results (long format)
results = pd.DataFrame({
'student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob',
'Charlie', 'Charlie', 'Charlie', 'Diana', 'Diana', 'Diana'],
'subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English',
'Math', 'Science', 'English', 'Math', 'Science', 'English'],
'semester': ['Fall', 'Fall', 'Fall', 'Fall', 'Fall', 'Fall',
'Spring', 'Spring', 'Spring', 'Spring', 'Spring', 'Spring'],
'score': [92, 88, 95, 78, 82, 71, 85, 91, 88, 94, 87, 92],
'grade': ['A', 'B+', 'A', 'C+', 'B', 'C', 'B', 'A-', 'B+', 'A', 'B+', 'A-']
})
Challenge 1: pivot_table() for Aggregated Reshaping
Problem: (a) Create a pivot table showing average score per student (rows) by subject (columns). (b) Add row and column totals (margins). (c) Create a pivot table with multiple aggregations: mean and max score per student-subject combination.
Solution
# (a) Basic pivot table: students as rows, subjects as columns
pivot = pd.pivot_table(
results,
values='score',
index='student',
columns='subject',
aggfunc='mean'
)
print(pivot)
# subject English Math Science
# student
# Alice 95.0 92.0 88.0
# Bob 71.0 78.0 82.0
# Charlie 88.0 85.0 91.0
# Diana 92.0 94.0 87.0
# (b) With margins (row/column totals)
pivot_margins = pd.pivot_table(
results,
values='score',
index='student',
columns='subject',
aggfunc='mean',
margins=True,
margins_name='Average'
)
print(pivot_margins)
# subject English Math Science Average
# student
# Alice 95.0 92.0 88.0 91.67
# Bob 71.0 78.0 82.0 77.00
# Charlie 88.0 85.0 91.0 88.00
# Diana 92.0 94.0 87.0 91.00
# Average 86.5 87.25 87.0 86.92
# (c) Multiple aggregations
pivot_multi = pd.pivot_table(
results,
values='score',
index='student',
columns='subject',
aggfunc=['mean', 'max']
)
print(pivot_multi)
# Produces a MultiIndex on columns: (mean, English), (mean, Math), etc.
# Flatten the column names
pivot_multi.columns = ['_'.join(col) for col in pivot_multi.columns]
print(pivot_multi.head())
# mean_English mean_Math mean_Science max_English max_Math max_Science
# SQL equivalent:
# SELECT student,
# AVG(CASE WHEN subject='Math' THEN score END) AS avg_math,
# AVG(CASE WHEN subject='Science' THEN score END) AS avg_science,
# AVG(CASE WHEN subject='English' THEN score END) AS avg_english
# FROM results GROUP BY student
Challenge 2: melt() for Wide to Long
Problem: Given a wide-format DataFrame of monthly revenue by product, (a) convert it to long format using melt(). (b) From the melted data, find the best month for each product. (c) Convert back to wide format to verify the round-trip.
Solution
# Wide-format revenue data
revenue_wide = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Doohickey'],
'Jan': [10000, 8000, 5000],
'Feb': [12000, 7500, 6000],
'Mar': [11000, 9000, 7500],
'Apr': [13000, 8500, 5500]
})
print("Wide format:")
print(revenue_wide)
# (a) Melt to long format
revenue_long = revenue_wide.melt(
id_vars='product', # Column(s) to keep as-is
var_name='month', # Name for the "variable" column
value_name='revenue' # Name for the "value" column
)
print("\nLong format:")
print(revenue_long)
# product month revenue
# 0 Widget Jan 10000
# 1 Gadget Jan 8000
# 2 Doohickey Jan 5000
# 3 Widget Feb 12000
# ...
# (b) Best month for each product
best_month = revenue_long.loc[
revenue_long.groupby('product')['revenue'].idxmax()
][['product', 'month', 'revenue']]
print("\nBest month per product:")
print(best_month)
# product month revenue
# 9 Widget Apr 13000
# 7 Gadget Mar 9000
# 11 Doohickey Mar 7500
# (c) Round-trip: long back to wide
back_to_wide = revenue_long.pivot(
index='product',
columns='month',
values='revenue'
)
# Reorder columns to match original
back_to_wide = back_to_wide[['Jan', 'Feb', 'Mar', 'Apr']].reset_index()
back_to_wide.columns.name = None # Remove the 'month' label from columns
print("\nBack to wide:")
print(back_to_wide)
# Should match revenue_wide exactly
Challenge 3: stack() and unstack()
Problem: (a) From the pivot table of student scores by subject, use stack() to convert back to long format. (b) Create a multi-level index (student, semester) and unstack the semester level to see Fall vs Spring side by side. (c) Unstack with fill_value to handle missing combinations.
Solution
# Start with the pivot table from Challenge 1
pivot = pd.pivot_table(results, values='score', index='student', columns='subject')
# (a) stack() converts columns to rows (wide to long)
stacked = pivot.stack()
print("Stacked (Series with MultiIndex):")
print(stacked.head(6))
# student subject
# Alice English 95.0
# Math 92.0
# Science 88.0
# Bob English 71.0
# Math 78.0
# Science 82.0
# dtype: float64
# Convert to DataFrame
stacked_df = stacked.reset_index()
stacked_df.columns = ['student', 'subject', 'score']
print(stacked_df.head())
# (b) Multi-level index with unstack
multi = results.set_index(['student', 'semester', 'subject'])['score']
print("\nMulti-level indexed Series:")
print(multi.head(6))
# Unstack 'semester' to compare Fall vs Spring
semester_comparison = multi.unstack(level='semester')
print("\nFall vs Spring comparison:")
print(semester_comparison)
# semester Fall Spring
# student subject
# Alice English 95.0 NaN
# Math 92.0 NaN
# Science 88.0 NaN
# Bob English 71.0 NaN
# Math 78.0 NaN
# Science 82.0 NaN
# Charlie English NaN 88.0
# ...
# (c) Unstack with fill_value for missing combinations
filled = multi.unstack(level='semester', fill_value=0)
print("\nWith fill_value=0:")
print(filled)
# Now missing Fall/Spring combos show 0 instead of NaN
# Practical use: compute semester-over-semester improvement
# (Only meaningful where both semesters have data)
both_semesters = semester_comparison.dropna()
if len(both_semesters) > 0:
both_semesters['improvement'] = both_semesters['Spring'] - both_semesters['Fall']
print(both_semesters)
Challenge 4: crosstab() for Frequency Tables
Problem: (a) Create a frequency table showing how many students received each grade per subject. (b) Normalize the crosstab to show percentages per subject (columns sum to 100%). (c) Add margins and find which subject has the highest percentage of A grades.
Solution
# (a) Basic crosstab: grade frequency per subject
freq = pd.crosstab(results['grade'], results['subject'])
print("Grade frequency per subject:")
print(freq)
# subject English Math Science
# grade
# A 1 1 0
# A- 1 0 1
# B 0 0 1
# B+ 1 0 1
# C 1 0 0
# C+ 0 1 0
# (b) Normalize by columns (each column sums to 100%)
freq_pct = pd.crosstab(
results['grade'],
results['subject'],
normalize='columns'
) * 100
print("\nGrade distribution per subject (%):")
print(freq_pct.round(1))
# subject English Math Science
# grade
# A 25.0 25.0 0.0
# A- 25.0 0.0 25.0
# B 0.0 0.0 25.0
# B+ 25.0 0.0 25.0
# C 25.0 0.0 0.0
# C+ 0.0 25.0 0.0
# (c) With margins: find highest % of A grades
freq_margins = pd.crosstab(
results['grade'],
results['subject'],
margins=True,
margins_name='Total'
)
print("\nWith margins:")
print(freq_margins)
# Percentage of A grades per subject
a_grades = results[results['grade'] == 'A']
a_pct = a_grades.groupby('subject').size() / results.groupby('subject').size() * 100
print(f"\n% of A grades by subject:")
print(a_pct.round(1))
best_subject = a_pct.idxmax()
print(f"Highest A-grade rate: {best_subject} ({a_pct.max():.1f}%)")
# SQL equivalent:
# SELECT subject,
# SUM(CASE WHEN grade = 'A' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS a_pct
# FROM results
# GROUP BY subject
Challenge 5: get_dummies() for One-Hot Encoding
Problem: (a) One-hot encode the 'subject' and 'semester' columns for ML model input. (b) Use drop_first=True to avoid the dummy variable trap (multicollinearity). (c) Given a DataFrame with mixed types, selectively encode only categorical columns and preserve numeric columns unchanged.
Solution
# (a) Basic one-hot encoding
encoded = pd.get_dummies(results, columns=['subject', 'semester'])
print("One-hot encoded columns:")
print(encoded.columns.tolist())
# ['student', 'score', 'grade', 'subject_English', 'subject_Math',
# 'subject_Science', 'semester_Fall', 'semester_Spring']
print(encoded.head(3))
# student score grade subject_English subject_Math subject_Science semester_Fall semester_Spring
# 0 Alice 92 A False True False True False
# 1 Alice 88 B+ False False True True False
# 2 Alice 95 A True False False True False
# (b) Drop first to avoid multicollinearity
# If subject has 3 values, we only need 2 dummy columns
encoded_drop = pd.get_dummies(
results,
columns=['subject', 'semester'],
drop_first=True,
dtype=int # Use 0/1 instead of True/False
)
print("\nWith drop_first (avoids dummy variable trap):")
print(encoded_drop.columns.tolist())
# 'subject_English' dropped, 'semester_Fall' dropped
# If subject_Math=0 and subject_Science=0, it must be English
print(encoded_drop.head(3))
# (c) Selective encoding preserving numeric columns
mixed_data = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'age': [25, 32, 28, 45],
'income': [50000, 75000, 62000, 95000],
'education': ['Bachelor', 'Master', 'PhD', 'Bachelor'],
'city': ['NYC', 'LA', 'NYC', 'Chicago']
})
# Identify categorical columns automatically
cat_cols = mixed_data.select_dtypes(include='object').columns.tolist()
print(f"\nCategorical columns: {cat_cols}")
# Encode only categorical columns
ml_ready = pd.get_dummies(mixed_data, columns=cat_cols, drop_first=True, dtype=int)
print(ml_ready)
# user_id age income education_Master education_PhD city_LA city_NYC
# 0 1 25 50000 0 0 0 1
# 1 2 32 75000 1 0 1 0
# 2 3 28 62000 0 1 0 1
# 3 4 45 95000 0 0 0 0
print(f"\nShape: {ml_ready.shape}")
print(f"Ready for sklearn: all columns are numeric")
ML interview tip: Always mention the dummy variable trap when encoding categoricals. With k categories, you only need k-1 dummies. Also, for tree-based models (XGBoost, Random Forest), one-hot encoding is often less efficient than ordinal encoding or target encoding. Know when each approach is appropriate.
Lilly Tech Systems