Advanced

Pivot & Reshape

Reshaping data between long and wide formats is a critical skill for data analysis and ML feature preparation. These 5 challenges cover every reshape operation you need for interviews.

💡

Why this matters for ML: ML models require data in specific shapes. User-item interaction matrices for recommender systems need pivot. Survey responses need melt. Categorical encoding needs get_dummies. Knowing when and how to reshape data is what separates data engineers from analysts.

Shared Dataset for This Lesson

import pandas as pd
import numpy as np

# Survey / experiment results (long format)
results = pd.DataFrame({
    'student': ['Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob',
                'Charlie', 'Charlie', 'Charlie', 'Diana', 'Diana', 'Diana'],
    'subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English',
                'Math', 'Science', 'English', 'Math', 'Science', 'English'],
    'semester': ['Fall', 'Fall', 'Fall', 'Fall', 'Fall', 'Fall',
                 'Spring', 'Spring', 'Spring', 'Spring', 'Spring', 'Spring'],
    'score': [92, 88, 95, 78, 82, 71, 85, 91, 88, 94, 87, 92],
    'grade': ['A', 'B+', 'A', 'C+', 'B', 'C', 'B', 'A-', 'B+', 'A', 'B+', 'A-']
})

Challenge 1: pivot_table() for Aggregated Reshaping

📝

Problem: (a) Create a pivot table showing average score per student (rows) by subject (columns). (b) Add row and column totals (margins). (c) Create a pivot table with multiple aggregations: mean and max score per student-subject combination.

Solution

# (a) Basic pivot table: students as rows, subjects as columns
pivot = pd.pivot_table(
    results,
    values='score',
    index='student',
    columns='subject',
    aggfunc='mean'
)
print(pivot)
# subject  English  Math  Science
# student
# Alice       95.0  92.0     88.0
# Bob         71.0  78.0     82.0
# Charlie     88.0  85.0     91.0
# Diana       92.0  94.0     87.0


# (b) With margins (row/column totals)
pivot_margins = pd.pivot_table(
    results,
    values='score',
    index='student',
    columns='subject',
    aggfunc='mean',
    margins=True,
    margins_name='Average'
)
print(pivot_margins)
# subject    English  Math  Science  Average
# student
# Alice         95.0  92.0     88.0    91.67
# Bob           71.0  78.0     82.0    77.00
# Charlie       88.0  85.0     91.0    88.00
# Diana         92.0  94.0     87.0    91.00
# Average       86.5  87.25    87.0    86.92


# (c) Multiple aggregations
pivot_multi = pd.pivot_table(
    results,
    values='score',
    index='student',
    columns='subject',
    aggfunc=['mean', 'max']
)
print(pivot_multi)
# Produces a MultiIndex on columns: (mean, English), (mean, Math), etc.

# Flatten the column names
pivot_multi.columns = ['_'.join(col) for col in pivot_multi.columns]
print(pivot_multi.head())
# mean_English  mean_Math  mean_Science  max_English  max_Math  max_Science

# SQL equivalent:
# SELECT student,
#        AVG(CASE WHEN subject='Math' THEN score END) AS avg_math,
#        AVG(CASE WHEN subject='Science' THEN score END) AS avg_science,
#        AVG(CASE WHEN subject='English' THEN score END) AS avg_english
# FROM results GROUP BY student

Challenge 2: melt() for Wide to Long

📝

Problem: Given a wide-format DataFrame of monthly revenue by product, (a) convert it to long format using melt(). (b) From the melted data, find the best month for each product. (c) Convert back to wide format to verify the round-trip.

Solution

# Wide-format revenue data
revenue_wide = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey'],
    'Jan': [10000, 8000, 5000],
    'Feb': [12000, 7500, 6000],
    'Mar': [11000, 9000, 7500],
    'Apr': [13000, 8500, 5500]
})
print("Wide format:")
print(revenue_wide)


# (a) Melt to long format
revenue_long = revenue_wide.melt(
    id_vars='product',       # Column(s) to keep as-is
    var_name='month',         # Name for the "variable" column
    value_name='revenue'      # Name for the "value" column
)
print("\nLong format:")
print(revenue_long)
#      product month  revenue
# 0     Widget   Jan    10000
# 1     Gadget   Jan     8000
# 2  Doohickey   Jan     5000
# 3     Widget   Feb    12000
# ...


# (b) Best month for each product
best_month = revenue_long.loc[
    revenue_long.groupby('product')['revenue'].idxmax()
][['product', 'month', 'revenue']]
print("\nBest month per product:")
print(best_month)
#      product month  revenue
# 9     Widget   Apr    13000
# 7     Gadget   Mar     9000
# 11 Doohickey   Mar     7500


# (c) Round-trip: long back to wide
back_to_wide = revenue_long.pivot(
    index='product',
    columns='month',
    values='revenue'
)
# Reorder columns to match original
back_to_wide = back_to_wide[['Jan', 'Feb', 'Mar', 'Apr']].reset_index()
back_to_wide.columns.name = None  # Remove the 'month' label from columns
print("\nBack to wide:")
print(back_to_wide)
# Should match revenue_wide exactly

Challenge 3: stack() and unstack()

📝

Problem: (a) From the pivot table of student scores by subject, use stack() to convert back to long format. (b) Create a multi-level index (student, semester) and unstack the semester level to see Fall vs Spring side by side. (c) Unstack with fill_value to handle missing combinations.

Solution

# Start with the pivot table from Challenge 1
pivot = pd.pivot_table(results, values='score', index='student', columns='subject')

# (a) stack() converts columns to rows (wide to long)
stacked = pivot.stack()
print("Stacked (Series with MultiIndex):")
print(stacked.head(6))
# student  subject
# Alice    English    95.0
#          Math       92.0
#          Science    88.0
# Bob      English    71.0
#          Math       78.0
#          Science    82.0
# dtype: float64

# Convert to DataFrame
stacked_df = stacked.reset_index()
stacked_df.columns = ['student', 'subject', 'score']
print(stacked_df.head())


# (b) Multi-level index with unstack
multi = results.set_index(['student', 'semester', 'subject'])['score']
print("\nMulti-level indexed Series:")
print(multi.head(6))

# Unstack 'semester' to compare Fall vs Spring
semester_comparison = multi.unstack(level='semester')
print("\nFall vs Spring comparison:")
print(semester_comparison)
# semester         Fall  Spring
# student subject
# Alice   English  95.0     NaN
#         Math     92.0     NaN
#         Science  88.0     NaN
# Bob     English  71.0     NaN
#         Math     78.0     NaN
#         Science  82.0     NaN
# Charlie English   NaN    88.0
# ...


# (c) Unstack with fill_value for missing combinations
filled = multi.unstack(level='semester', fill_value=0)
print("\nWith fill_value=0:")
print(filled)
# Now missing Fall/Spring combos show 0 instead of NaN

# Practical use: compute semester-over-semester improvement
# (Only meaningful where both semesters have data)
both_semesters = semester_comparison.dropna()
if len(both_semesters) > 0:
    both_semesters['improvement'] = both_semesters['Spring'] - both_semesters['Fall']
    print(both_semesters)

Challenge 4: crosstab() for Frequency Tables

📝

Problem: (a) Create a frequency table showing how many students received each grade per subject. (b) Normalize the crosstab to show percentages per subject (columns sum to 100%). (c) Add margins and find which subject has the highest percentage of A grades.

Solution

# (a) Basic crosstab: grade frequency per subject
freq = pd.crosstab(results['grade'], results['subject'])
print("Grade frequency per subject:")
print(freq)
# subject  English  Math  Science
# grade
# A             1     1        0
# A-            1     0        1
# B             0     0        1
# B+            1     0        1
# C             1     0        0
# C+            0     1        0


# (b) Normalize by columns (each column sums to 100%)
freq_pct = pd.crosstab(
    results['grade'],
    results['subject'],
    normalize='columns'
) * 100
print("\nGrade distribution per subject (%):")
print(freq_pct.round(1))
# subject  English   Math  Science
# grade
# A           25.0   25.0      0.0
# A-          25.0    0.0     25.0
# B            0.0    0.0     25.0
# B+          25.0    0.0     25.0
# C           25.0    0.0      0.0
# C+           0.0   25.0      0.0


# (c) With margins: find highest % of A grades
freq_margins = pd.crosstab(
    results['grade'],
    results['subject'],
    margins=True,
    margins_name='Total'
)
print("\nWith margins:")
print(freq_margins)

# Percentage of A grades per subject
a_grades = results[results['grade'] == 'A']
a_pct = a_grades.groupby('subject').size() / results.groupby('subject').size() * 100
print(f"\n% of A grades by subject:")
print(a_pct.round(1))
best_subject = a_pct.idxmax()
print(f"Highest A-grade rate: {best_subject} ({a_pct.max():.1f}%)")

# SQL equivalent:
# SELECT subject,
#        SUM(CASE WHEN grade = 'A' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS a_pct
# FROM results
# GROUP BY subject

Challenge 5: get_dummies() for One-Hot Encoding

📝

Problem: (a) One-hot encode the 'subject' and 'semester' columns for ML model input. (b) Use drop_first=True to avoid the dummy variable trap (multicollinearity). (c) Given a DataFrame with mixed types, selectively encode only categorical columns and preserve numeric columns unchanged.

Solution

# (a) Basic one-hot encoding
encoded = pd.get_dummies(results, columns=['subject', 'semester'])
print("One-hot encoded columns:")
print(encoded.columns.tolist())
# ['student', 'score', 'grade', 'subject_English', 'subject_Math',
#  'subject_Science', 'semester_Fall', 'semester_Spring']

print(encoded.head(3))
#   student  score grade  subject_English  subject_Math  subject_Science  semester_Fall  semester_Spring
# 0   Alice     92     A            False          True            False           True            False
# 1   Alice     88    B+            False         False             True           True            False
# 2   Alice     95     A             True         False            False           True            False


# (b) Drop first to avoid multicollinearity
# If subject has 3 values, we only need 2 dummy columns
encoded_drop = pd.get_dummies(
    results,
    columns=['subject', 'semester'],
    drop_first=True,
    dtype=int  # Use 0/1 instead of True/False
)
print("\nWith drop_first (avoids dummy variable trap):")
print(encoded_drop.columns.tolist())
# 'subject_English' dropped, 'semester_Fall' dropped
# If subject_Math=0 and subject_Science=0, it must be English

print(encoded_drop.head(3))


# (c) Selective encoding preserving numeric columns
mixed_data = pd.DataFrame({
    'user_id': [1, 2, 3, 4],
    'age': [25, 32, 28, 45],
    'income': [50000, 75000, 62000, 95000],
    'education': ['Bachelor', 'Master', 'PhD', 'Bachelor'],
    'city': ['NYC', 'LA', 'NYC', 'Chicago']
})

# Identify categorical columns automatically
cat_cols = mixed_data.select_dtypes(include='object').columns.tolist()
print(f"\nCategorical columns: {cat_cols}")

# Encode only categorical columns
ml_ready = pd.get_dummies(mixed_data, columns=cat_cols, drop_first=True, dtype=int)
print(ml_ready)
#    user_id  age  income  education_Master  education_PhD  city_LA  city_NYC
# 0        1   25   50000                 0              0        0         1
# 1        2   32   75000                 1              0        1         0
# 2        3   28   62000                 0              1        0         1
# 3        4   45   95000                 0              0        0         0

print(f"\nShape: {ml_ready.shape}")
print(f"Ready for sklearn: all columns are numeric")

💡

ML interview tip: Always mention the dummy variable trap when encoding categoricals. With k categories, you only need k-1 dummies. Also, for tree-based models (XGBoost, Random Forest), one-hot encoding is often less efficient than ordinal encoding or target encoding. Know when each approach is appropriate.

← Previous Window Functions Next → Time Series Operations