Intermediate
Merge, Join & Concat
Combining data from multiple sources is a daily task in data science. These 6 challenges cover every join pattern you will encounter in interviews, from basic inner joins to the tricky anti-join and cross join patterns.
Why this matters for ML: Training data is almost never in a single table. You merge user features with transaction logs, join prediction results with ground truth labels, and concatenate data from multiple time periods. Getting joins wrong means training on corrupted data — the most expensive bug in ML.
Shared Datasets for This Lesson
import pandas as pd
import numpy as np
# Customers table
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105, 106],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'city': ['New York', 'San Francisco', 'Chicago', 'New York', 'Boston', 'Chicago'],
'signup_date': pd.to_datetime(['2023-01-15', '2023-03-22', '2023-06-10',
'2023-02-01', '2023-08-05', '2023-11-20'])
})
# Orders table (some customers have no orders, some order_ids have no customer match)
orders = pd.DataFrame({
'order_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'customer_id': [101, 102, 101, 103, 102, 104, 101, 999, 103, 102],
'amount': [250, 180, 320, 150, 420, 275, 190, 500, 310, 95],
'order_date': pd.to_datetime(['2024-01-05', '2024-01-12', '2024-01-20',
'2024-02-03', '2024-02-15', '2024-02-28',
'2024-03-10', '2024-03-15', '2024-03-22',
'2024-04-01'])
})
# Products table
products = pd.DataFrame({
'product_id': ['P1', 'P2', 'P3', 'P4'],
'product_name': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],
'category': ['A', 'B', 'A', 'C']
})
# Regions table
regions = pd.DataFrame({
'region': ['East', 'West', 'Central'],
'manager': ['Smith', 'Jones', 'Brown']
})
Challenge 1: Inner, Left, and Outer Joins
Problem: (a) Find all orders with matching customer details (inner join). (b) Show all customers with their orders; customers without orders should still appear (left join). (c) Show the full picture: all customers and all orders, matched where possible (outer join). Count NaN rows in each case.
Solution
# (a) Inner join: only rows where customer_id exists in both tables
inner = pd.merge(orders, customers, on='customer_id', how='inner')
print(f"Inner join rows: {len(inner)}") # 9 (order with customer_id=999 excluded)
print(inner[['order_id', 'customer_id', 'name', 'amount']].head())
# order_id customer_id name amount
# 0 1 101 Alice 250
# 1 3 101 Alice 320
# 2 7 101 Alice 190
# 3 2 102 Bob 180
# 4 5 102 Bob 420
# (b) Left join: all customers, matched orders where available
left = pd.merge(customers, orders, on='customer_id', how='left')
print(f"\nLeft join rows: {len(left)}") # 11
# Eve and Frank have NaN for order columns
no_orders = left[left['order_id'].isna()]
print(f"Customers with no orders: {no_orders['name'].tolist()}")
# ['Eve', 'Frank']
# SQL equivalent:
# SELECT c.*, o.*
# FROM customers c
# LEFT JOIN orders o ON c.customer_id = o.customer_id
# (c) Outer join: everything from both sides
outer = pd.merge(customers, orders, on='customer_id', how='outer')
print(f"\nOuter join rows: {len(outer)}") # 12
orphan_orders = outer[outer['name'].isna()]
print(f"Orders with no customer match: {orphan_orders['order_id'].tolist()}")
# [8] (customer_id=999)
print(f"Customers with no orders: {outer[outer['order_id'].isna()]['name'].tolist()}")
# ['Eve', 'Frank']
Challenge 2: Merge on Index
Problem: (a) Set customer_id as index on both DataFrames and join them using the index. (b) Merge a DataFrame with a regular column against another with an index key. (c) After merging, reset the index and compute total spend per customer.
Solution
# (a) Join on index (both sides indexed)
customers_idx = customers.set_index('customer_id')
orders_idx = orders.set_index('customer_id')
# Using .join() - joins on index by default
joined = customers_idx.join(orders_idx, how='inner', lsuffix='_cust', rsuffix='_ord')
print(joined[['name', 'order_id', 'amount']].head())
# name order_id amount
# customer_id
# 101 Alice 1 250
# 101 Alice 3 320
# 101 Alice 7 190
# 102 Bob 2 180
# 102 Bob 5 420
# (b) Merge column against index
# Left df has customer_id as a column, right has it as index
result = pd.merge(
orders, # customer_id is a column
customers_idx, # customer_id is the index
left_on='customer_id',
right_index=True,
how='left'
)
print(result[['order_id', 'customer_id', 'name', 'amount']].head())
# (c) Reset index and compute total spend
customer_spend = (
joined.reset_index()
.groupby(['customer_id', 'name'])['amount']
.agg(['sum', 'count', 'mean'])
.round(2)
.sort_values('sum', ascending=False)
)
customer_spend.columns = ['total_spend', 'order_count', 'avg_order']
print(customer_spend)
# total_spend order_count avg_order
# customer_id name
# 102 Bob 695 3 231.67
# 101 Alice 760 3 253.33
# 103 Charlie 460 2 230.00
# 104 Diana 275 1 275.00
Challenge 3: Anti-Join (Find Non-Matching Rows)
Problem: (a) Find customers who have never placed an order (left anti-join). (b) Find orders that have no matching customer record (right anti-join). (c) Find customers who placed orders in January but NOT in March.
Solution
# (a) Customers with no orders (LEFT ANTI-JOIN)
# Method 1: merge + filter NaN (most readable)
merged = pd.merge(customers, orders, on='customer_id', how='left', indicator=True)
no_orders = merged[merged['_merge'] == 'left_only'][['customer_id', 'name', 'city']]
print("Customers with no orders:")
print(no_orders)
# customer_id name city
# 9 105 Eve Boston
# 10 106 Frank Chicago
# Method 2: using ~isin (shorter)
ordered_ids = orders['customer_id'].unique()
no_orders_v2 = customers[~customers['customer_id'].isin(ordered_ids)]
print(no_orders_v2[['customer_id', 'name']])
# SQL equivalent:
# SELECT c.* FROM customers c
# LEFT JOIN orders o ON c.customer_id = o.customer_id
# WHERE o.customer_id IS NULL
# (b) Orders with no customer match (RIGHT ANTI-JOIN)
merged2 = pd.merge(orders, customers, on='customer_id', how='left', indicator=True)
orphans = merged2[merged2['_merge'] == 'left_only'][['order_id', 'customer_id', 'amount']]
print("\nOrphan orders:")
print(orphans)
# order_id customer_id amount
# 7 8 999 500
# (c) Customers who ordered in January but NOT March
jan_customers = orders[orders['order_date'].dt.month == 1]['customer_id'].unique()
mar_customers = orders[orders['order_date'].dt.month == 3]['customer_id'].unique()
jan_not_mar = set(jan_customers) - set(mar_customers)
result = customers[customers['customer_id'].isin(jan_not_mar)]
print("\nOrdered in Jan but not Mar:")
print(result[['customer_id', 'name']])
# customer_id name
# 1 102 Bob (Bob ordered in Jan, Feb, Apr but not Mar)
Interview classic: Anti-joins are one of the most frequently asked join patterns. The
indicator=True parameter adds a _merge column showing 'left_only', 'right_only', or 'both'. This is cleaner than the ~isin() approach when you need to debug join results.Challenge 4: Cross Join (Cartesian Product)
Problem: (a) Generate all possible customer-product combinations. (b) From the cross join, find which customer-product pairs have no actual orders (useful for recommendation systems). (c) Create a calendar scaffold: every customer should have a row for every month in Q1 2024, even if they had no orders that month.
Solution
# (a) Cross join: every customer paired with every product
cross = pd.merge(customers, products, how='cross')
print(f"Cross join rows: {len(cross)}") # 6 customers * 4 products = 24
print(cross[['name', 'product_name']].head(8))
# name product_name
# 0 Alice Widget
# 1 Alice Gadget
# 2 Alice Doohickey
# 3 Alice Thingamajig
# 4 Bob Widget
# 5 Bob Gadget
# 6 Bob Doohickey
# 7 Bob Thingamajig
# (b) Find missing customer-product pairs (for recommendations)
# First, simulate an order-product mapping
order_products = pd.DataFrame({
'customer_id': [101, 101, 102, 103, 104],
'product_id': ['P1', 'P3', 'P2', 'P1', 'P4']
})
all_combos = pd.merge(
customers[['customer_id', 'name']],
products[['product_id', 'product_name']],
how='cross'
)
# Left join with actual purchases to find gaps
with_purchases = pd.merge(
all_combos, order_products,
on=['customer_id', 'product_id'],
how='left',
indicator=True
)
not_purchased = with_purchases[with_purchases['_merge'] == 'left_only']
print(f"\nMissing pairs (recommendation candidates): {len(not_purchased)}")
print(not_purchased[['name', 'product_name']].head())
# (c) Calendar scaffold for time series completeness
months = pd.DataFrame({
'month': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01'])
})
scaffold = pd.merge(customers[['customer_id', 'name']], months, how='cross')
print(f"\nScaffold rows: {len(scaffold)}") # 6 customers * 3 months = 18
# Add actual order counts
orders['month'] = orders['order_date'].dt.to_period('M').dt.to_timestamp()
monthly_orders = orders.groupby(['customer_id', 'month']).agg(
order_count=('order_id', 'count'),
total_amount=('amount', 'sum')
).reset_index()
complete = pd.merge(scaffold, monthly_orders, on=['customer_id', 'month'], how='left')
complete['order_count'] = complete['order_count'].fillna(0).astype(int)
complete['total_amount'] = complete['total_amount'].fillna(0)
print(complete[complete['name'] == 'Alice'])
# customer_id name month order_count total_amount
# 0 101 Alice 2024-01-01 2 570.0
# 1 101 Alice 2024-02-01 0 0.0
# 2 101 Alice 2024-03-01 1 190.0
Challenge 5: concat() for Stacking DataFrames
Problem: (a) Combine Q1 and Q2 order data vertically (UNION ALL). (b) Combine two DataFrames with different columns side by side (column-wise concat). (c) Handle the case where two DataFrames have overlapping but not identical columns.
Solution
# (a) Vertical concat (UNION ALL)
q1_orders = orders[orders['order_date'] < '2024-04-01'].copy()
q2_orders = pd.DataFrame({
'order_id': [11, 12, 13],
'customer_id': [101, 105, 103],
'amount': [340, 220, 180],
'order_date': pd.to_datetime(['2024-04-10', '2024-05-01', '2024-06-15'])
})
all_orders = pd.concat([q1_orders, q2_orders], ignore_index=True)
print(f"Combined orders: {len(all_orders)}") # 13
print(all_orders.tail())
# NOTE: ignore_index=True resets the index to 0..n-1
# Without it, you get duplicate index values which cause bugs downstream
# (b) Column-wise concat (side by side)
summary_stats = pd.DataFrame({
'metric': ['total_orders', 'avg_amount', 'unique_customers'],
'q1': [10, 269, 5],
'q2': [3, 246.67, 3]
})
yoy_comparison = pd.DataFrame({
'metric': ['total_orders', 'avg_amount', 'unique_customers'],
'prev_year': [8, 230, 4]
})
combined = pd.concat(
[summary_stats.set_index('metric'), yoy_comparison.set_index('metric')],
axis=1
)
print(combined)
# q1 q2 prev_year
# metric
# total_orders 10.0 3.00 8
# avg_amount 269.0 246.67 230
# unique_customers 5.0 3.00 4
# (c) Concat with mismatched columns
df_v1 = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B'], 'score': [90, 85]})
df_v2 = pd.DataFrame({'id': [3, 4], 'name': ['C', 'D'], 'grade': ['A', 'B+']})
# Default: fills missing columns with NaN
result = pd.concat([df_v1, df_v2], ignore_index=True)
print(result)
# id name score grade
# 0 1 A 90.0 NaN
# 1 2 B 85.0 NaN
# 2 3 C NaN A
# 3 4 D NaN B+
# Only keep common columns (INTERSECT columns)
common_cols = df_v1.columns.intersection(df_v2.columns)
result_common = pd.concat(
[df_v1[common_cols], df_v2[common_cols]], ignore_index=True
)
print(result_common)
# id name
# 0 1 A
# 1 2 B
# 2 3 C
# 3 4 D
Challenge 6: combine_first() for Filling Gaps
Problem: You have two data sources for customer addresses. Source A is more recent but has gaps. Source B is older but more complete. Combine them so that Source A values are preferred, but Source B fills in wherever Source A has NaN.
Solution
# Source A: recent but incomplete
source_a = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'email': ['alice@new.com', np.nan, 'charlie@new.com', np.nan, 'eve@new.com'],
'phone': ['555-0101', '555-0102', np.nan, np.nan, '555-0105'],
'address': ['123 Main St', np.nan, np.nan, '456 Oak Ave', np.nan]
}).set_index('customer_id')
# Source B: older but complete
source_b = pd.DataFrame({
'customer_id': [101, 102, 103, 104, 105],
'email': ['alice@old.com', 'bob@old.com', 'charlie@old.com', 'diana@old.com', 'eve@old.com'],
'phone': ['555-9999', '555-0102', '555-0103', '555-0104', '555-9999'],
'address': ['789 Elm St', '321 Pine St', '654 Maple Dr', '456 Oak Ave', '987 Cedar Ln']
}).set_index('customer_id')
# combine_first: prefer source_a, fill gaps from source_b
combined = source_a.combine_first(source_b)
print(combined)
# email phone address
# customer_id
# 101 alice@new.com 555-0101 123 Main St
# 102 bob@old.com 555-0102 321 Pine St
# 103 charlie@new.com 555-0103 654 Maple Dr
# 104 diana@old.com 555-0104 456 Oak Ave
# 105 eve@new.com 555-0105 987 Cedar Ln
# Verify: source A wins where it has values, source B fills NaN
print(f"\nAlice email: {combined.loc[101, 'email']}") # alice@new.com (from A)
print(f"Bob email: {combined.loc[102, 'email']}") # bob@old.com (from B, A was NaN)
# Alternative approach using .fillna():
result_v2 = source_a.fillna(source_b)
print(result_v2.equals(combined)) # True
# SQL equivalent:
# SELECT customer_id,
# COALESCE(a.email, b.email) AS email,
# COALESCE(a.phone, b.phone) AS phone,
# COALESCE(a.address, b.address) AS address
# FROM source_a a
# FULL OUTER JOIN source_b b USING (customer_id)
Real-world use case:
combine_first() is commonly used when merging data from multiple ETL pipelines, combining survey data with CRM data, or creating "golden records" from multiple source systems. In ML, it is used to fill feature gaps from different data sources before training.
Lilly Tech Systems