Intermediate

Merge, Join & Concat

Combining data from multiple sources is a daily task in data science. These 6 challenges cover every join pattern you will encounter in interviews, from basic inner joins to the tricky anti-join and cross join patterns.

💡
Why this matters for ML: Training data is almost never in a single table. You merge user features with transaction logs, join prediction results with ground truth labels, and concatenate data from multiple time periods. Getting joins wrong means training on corrupted data — the most expensive bug in ML.

Shared Datasets for This Lesson

import pandas as pd
import numpy as np

# Customers table
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105, 106],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'city': ['New York', 'San Francisco', 'Chicago', 'New York', 'Boston', 'Chicago'],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-03-22', '2023-06-10',
                                    '2023-02-01', '2023-08-05', '2023-11-20'])
})

# Orders table (some customers have no orders, some order_ids have no customer match)
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'customer_id': [101, 102, 101, 103, 102, 104, 101, 999, 103, 102],
    'amount': [250, 180, 320, 150, 420, 275, 190, 500, 310, 95],
    'order_date': pd.to_datetime(['2024-01-05', '2024-01-12', '2024-01-20',
                                   '2024-02-03', '2024-02-15', '2024-02-28',
                                   '2024-03-10', '2024-03-15', '2024-03-22',
                                   '2024-04-01'])
})

# Products table
products = pd.DataFrame({
    'product_id': ['P1', 'P2', 'P3', 'P4'],
    'product_name': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],
    'category': ['A', 'B', 'A', 'C']
})

# Regions table
regions = pd.DataFrame({
    'region': ['East', 'West', 'Central'],
    'manager': ['Smith', 'Jones', 'Brown']
})

Challenge 1: Inner, Left, and Outer Joins

📝
Problem: (a) Find all orders with matching customer details (inner join). (b) Show all customers with their orders; customers without orders should still appear (left join). (c) Show the full picture: all customers and all orders, matched where possible (outer join). Count NaN rows in each case.

Solution

# (a) Inner join: only rows where customer_id exists in both tables
inner = pd.merge(orders, customers, on='customer_id', how='inner')
print(f"Inner join rows: {len(inner)}")  # 9 (order with customer_id=999 excluded)
print(inner[['order_id', 'customer_id', 'name', 'amount']].head())
#    order_id  customer_id     name  amount
# 0         1          101    Alice     250
# 1         3          101    Alice     320
# 2         7          101    Alice     190
# 3         2          102      Bob     180
# 4         5          102      Bob     420


# (b) Left join: all customers, matched orders where available
left = pd.merge(customers, orders, on='customer_id', how='left')
print(f"\nLeft join rows: {len(left)}")  # 11
# Eve and Frank have NaN for order columns
no_orders = left[left['order_id'].isna()]
print(f"Customers with no orders: {no_orders['name'].tolist()}")
# ['Eve', 'Frank']

# SQL equivalent:
# SELECT c.*, o.*
# FROM customers c
# LEFT JOIN orders o ON c.customer_id = o.customer_id


# (c) Outer join: everything from both sides
outer = pd.merge(customers, orders, on='customer_id', how='outer')
print(f"\nOuter join rows: {len(outer)}")  # 12
orphan_orders = outer[outer['name'].isna()]
print(f"Orders with no customer match: {orphan_orders['order_id'].tolist()}")
# [8] (customer_id=999)
print(f"Customers with no orders: {outer[outer['order_id'].isna()]['name'].tolist()}")
# ['Eve', 'Frank']

Challenge 2: Merge on Index

📝
Problem: (a) Set customer_id as index on both DataFrames and join them using the index. (b) Merge a DataFrame with a regular column against another with an index key. (c) After merging, reset the index and compute total spend per customer.

Solution

# (a) Join on index (both sides indexed)
customers_idx = customers.set_index('customer_id')
orders_idx = orders.set_index('customer_id')

# Using .join() - joins on index by default
joined = customers_idx.join(orders_idx, how='inner', lsuffix='_cust', rsuffix='_ord')
print(joined[['name', 'order_id', 'amount']].head())
#              name  order_id  amount
# customer_id
# 101         Alice         1     250
# 101         Alice         3     320
# 101         Alice         7     190
# 102           Bob         2     180
# 102           Bob         5     420


# (b) Merge column against index
# Left df has customer_id as a column, right has it as index
result = pd.merge(
    orders,              # customer_id is a column
    customers_idx,       # customer_id is the index
    left_on='customer_id',
    right_index=True,
    how='left'
)
print(result[['order_id', 'customer_id', 'name', 'amount']].head())


# (c) Reset index and compute total spend
customer_spend = (
    joined.reset_index()
    .groupby(['customer_id', 'name'])['amount']
    .agg(['sum', 'count', 'mean'])
    .round(2)
    .sort_values('sum', ascending=False)
)
customer_spend.columns = ['total_spend', 'order_count', 'avg_order']
print(customer_spend)
#                      total_spend  order_count  avg_order
# customer_id name
# 102         Bob              695            3     231.67
# 101         Alice            760            3     253.33
# 103         Charlie          460            2     230.00
# 104         Diana            275            1     275.00

Challenge 3: Anti-Join (Find Non-Matching Rows)

📝
Problem: (a) Find customers who have never placed an order (left anti-join). (b) Find orders that have no matching customer record (right anti-join). (c) Find customers who placed orders in January but NOT in March.

Solution

# (a) Customers with no orders (LEFT ANTI-JOIN)
# Method 1: merge + filter NaN (most readable)
merged = pd.merge(customers, orders, on='customer_id', how='left', indicator=True)
no_orders = merged[merged['_merge'] == 'left_only'][['customer_id', 'name', 'city']]
print("Customers with no orders:")
print(no_orders)
#    customer_id   name    city
# 9          105    Eve  Boston
# 10         106  Frank Chicago

# Method 2: using ~isin (shorter)
ordered_ids = orders['customer_id'].unique()
no_orders_v2 = customers[~customers['customer_id'].isin(ordered_ids)]
print(no_orders_v2[['customer_id', 'name']])

# SQL equivalent:
# SELECT c.* FROM customers c
# LEFT JOIN orders o ON c.customer_id = o.customer_id
# WHERE o.customer_id IS NULL


# (b) Orders with no customer match (RIGHT ANTI-JOIN)
merged2 = pd.merge(orders, customers, on='customer_id', how='left', indicator=True)
orphans = merged2[merged2['_merge'] == 'left_only'][['order_id', 'customer_id', 'amount']]
print("\nOrphan orders:")
print(orphans)
#    order_id  customer_id  amount
# 7         8          999     500


# (c) Customers who ordered in January but NOT March
jan_customers = orders[orders['order_date'].dt.month == 1]['customer_id'].unique()
mar_customers = orders[orders['order_date'].dt.month == 3]['customer_id'].unique()

jan_not_mar = set(jan_customers) - set(mar_customers)
result = customers[customers['customer_id'].isin(jan_not_mar)]
print("\nOrdered in Jan but not Mar:")
print(result[['customer_id', 'name']])
#    customer_id   name
# 1          102    Bob  (Bob ordered in Jan, Feb, Apr but not Mar)
💡
Interview classic: Anti-joins are one of the most frequently asked join patterns. The indicator=True parameter adds a _merge column showing 'left_only', 'right_only', or 'both'. This is cleaner than the ~isin() approach when you need to debug join results.

Challenge 4: Cross Join (Cartesian Product)

📝
Problem: (a) Generate all possible customer-product combinations. (b) From the cross join, find which customer-product pairs have no actual orders (useful for recommendation systems). (c) Create a calendar scaffold: every customer should have a row for every month in Q1 2024, even if they had no orders that month.

Solution

# (a) Cross join: every customer paired with every product
cross = pd.merge(customers, products, how='cross')
print(f"Cross join rows: {len(cross)}")  # 6 customers * 4 products = 24
print(cross[['name', 'product_name']].head(8))
#      name product_name
# 0   Alice       Widget
# 1   Alice       Gadget
# 2   Alice    Doohickey
# 3   Alice  Thingamajig
# 4     Bob       Widget
# 5     Bob       Gadget
# 6     Bob    Doohickey
# 7     Bob  Thingamajig


# (b) Find missing customer-product pairs (for recommendations)
# First, simulate an order-product mapping
order_products = pd.DataFrame({
    'customer_id': [101, 101, 102, 103, 104],
    'product_id': ['P1', 'P3', 'P2', 'P1', 'P4']
})

all_combos = pd.merge(
    customers[['customer_id', 'name']],
    products[['product_id', 'product_name']],
    how='cross'
)

# Left join with actual purchases to find gaps
with_purchases = pd.merge(
    all_combos, order_products,
    on=['customer_id', 'product_id'],
    how='left',
    indicator=True
)
not_purchased = with_purchases[with_purchases['_merge'] == 'left_only']
print(f"\nMissing pairs (recommendation candidates): {len(not_purchased)}")
print(not_purchased[['name', 'product_name']].head())


# (c) Calendar scaffold for time series completeness
months = pd.DataFrame({
    'month': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01'])
})

scaffold = pd.merge(customers[['customer_id', 'name']], months, how='cross')
print(f"\nScaffold rows: {len(scaffold)}")  # 6 customers * 3 months = 18

# Add actual order counts
orders['month'] = orders['order_date'].dt.to_period('M').dt.to_timestamp()
monthly_orders = orders.groupby(['customer_id', 'month']).agg(
    order_count=('order_id', 'count'),
    total_amount=('amount', 'sum')
).reset_index()

complete = pd.merge(scaffold, monthly_orders, on=['customer_id', 'month'], how='left')
complete['order_count'] = complete['order_count'].fillna(0).astype(int)
complete['total_amount'] = complete['total_amount'].fillna(0)
print(complete[complete['name'] == 'Alice'])
#    customer_id   name      month  order_count  total_amount
# 0          101  Alice 2024-01-01            2         570.0
# 1          101  Alice 2024-02-01            0           0.0
# 2          101  Alice 2024-03-01            1         190.0

Challenge 5: concat() for Stacking DataFrames

📝
Problem: (a) Combine Q1 and Q2 order data vertically (UNION ALL). (b) Combine two DataFrames with different columns side by side (column-wise concat). (c) Handle the case where two DataFrames have overlapping but not identical columns.

Solution

# (a) Vertical concat (UNION ALL)
q1_orders = orders[orders['order_date'] < '2024-04-01'].copy()
q2_orders = pd.DataFrame({
    'order_id': [11, 12, 13],
    'customer_id': [101, 105, 103],
    'amount': [340, 220, 180],
    'order_date': pd.to_datetime(['2024-04-10', '2024-05-01', '2024-06-15'])
})

all_orders = pd.concat([q1_orders, q2_orders], ignore_index=True)
print(f"Combined orders: {len(all_orders)}")  # 13
print(all_orders.tail())
# NOTE: ignore_index=True resets the index to 0..n-1
# Without it, you get duplicate index values which cause bugs downstream


# (b) Column-wise concat (side by side)
summary_stats = pd.DataFrame({
    'metric': ['total_orders', 'avg_amount', 'unique_customers'],
    'q1': [10, 269, 5],
    'q2': [3, 246.67, 3]
})

yoy_comparison = pd.DataFrame({
    'metric': ['total_orders', 'avg_amount', 'unique_customers'],
    'prev_year': [8, 230, 4]
})

combined = pd.concat(
    [summary_stats.set_index('metric'), yoy_comparison.set_index('metric')],
    axis=1
)
print(combined)
#                   q1      q2  prev_year
# metric
# total_orders   10.0    3.00          8
# avg_amount    269.0  246.67        230
# unique_customers 5.0  3.00          4


# (c) Concat with mismatched columns
df_v1 = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B'], 'score': [90, 85]})
df_v2 = pd.DataFrame({'id': [3, 4], 'name': ['C', 'D'], 'grade': ['A', 'B+']})

# Default: fills missing columns with NaN
result = pd.concat([df_v1, df_v2], ignore_index=True)
print(result)
#    id name  score grade
# 0   1    A   90.0   NaN
# 1   2    B   85.0   NaN
# 2   3    C    NaN     A
# 3   4    D    NaN    B+

# Only keep common columns (INTERSECT columns)
common_cols = df_v1.columns.intersection(df_v2.columns)
result_common = pd.concat(
    [df_v1[common_cols], df_v2[common_cols]], ignore_index=True
)
print(result_common)
#    id name
# 0   1    A
# 1   2    B
# 2   3    C
# 3   4    D

Challenge 6: combine_first() for Filling Gaps

📝
Problem: You have two data sources for customer addresses. Source A is more recent but has gaps. Source B is older but more complete. Combine them so that Source A values are preferred, but Source B fills in wherever Source A has NaN.

Solution

# Source A: recent but incomplete
source_a = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'email': ['alice@new.com', np.nan, 'charlie@new.com', np.nan, 'eve@new.com'],
    'phone': ['555-0101', '555-0102', np.nan, np.nan, '555-0105'],
    'address': ['123 Main St', np.nan, np.nan, '456 Oak Ave', np.nan]
}).set_index('customer_id')

# Source B: older but complete
source_b = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'email': ['alice@old.com', 'bob@old.com', 'charlie@old.com', 'diana@old.com', 'eve@old.com'],
    'phone': ['555-9999', '555-0102', '555-0103', '555-0104', '555-9999'],
    'address': ['789 Elm St', '321 Pine St', '654 Maple Dr', '456 Oak Ave', '987 Cedar Ln']
}).set_index('customer_id')

# combine_first: prefer source_a, fill gaps from source_b
combined = source_a.combine_first(source_b)
print(combined)
#              email       phone       address
# customer_id
# 101      alice@new.com  555-0101   123 Main St
# 102        bob@old.com  555-0102   321 Pine St
# 103    charlie@new.com  555-0103   654 Maple Dr
# 104      diana@old.com  555-0104   456 Oak Ave
# 105        eve@new.com  555-0105   987 Cedar Ln

# Verify: source A wins where it has values, source B fills NaN
print(f"\nAlice email: {combined.loc[101, 'email']}")    # alice@new.com (from A)
print(f"Bob email: {combined.loc[102, 'email']}")        # bob@old.com (from B, A was NaN)

# Alternative approach using .fillna():
result_v2 = source_a.fillna(source_b)
print(result_v2.equals(combined))  # True

# SQL equivalent:
# SELECT customer_id,
#        COALESCE(a.email, b.email) AS email,
#        COALESCE(a.phone, b.phone) AS phone,
#        COALESCE(a.address, b.address) AS address
# FROM source_a a
# FULL OUTER JOIN source_b b USING (customer_id)
💡
Real-world use case: combine_first() is commonly used when merging data from multiple ETL pipelines, combining survey data with CRM data, or creating "golden records" from multiple source systems. In ML, it is used to fill feature gaps from different data sources before training.