Pandas for Data Interviews
Pandas is the single most tested library in data science and ML engineering interviews. This lesson covers what interviewers expect, how pandas maps to SQL, and the performance patterns that separate strong candidates from the rest.
Why Pandas Dominates Data Interviews
Every data science and ML engineering role requires manipulating tabular data. While SQL handles database queries, pandas is the standard for in-memory data transformation, feature engineering, and exploratory analysis. Interviewers test pandas because it reveals how you think about data.
Feature Engineering
Building ML features from raw data requires groupby, window functions, pivoting, and merging — all core pandas operations tested in interviews.
Data Cleaning
Real datasets have missing values, duplicates, and inconsistent formats. Interviewers test your ability to handle messy data fluently with pandas.
Exploratory Analysis
Quick data profiling, distribution analysis, and correlation checks are daily tasks. Doing them efficiently in pandas shows practical experience.
What Companies Actually Test
Pandas interview questions vary by company type. Here is what to expect:
| Company Type | Difficulty | Focus Areas | Format |
|---|---|---|---|
| Google / Meta DS | Medium-Hard | GroupBy, window functions, merge, time series | Colab notebook |
| Amazon ML | Medium | Selection, groupby, pivot, data cleaning | Whiteboard or Colab |
| Fintech / Quant | Hard | Time series, rolling windows, performance | Live coding |
| Startups | Easy-Medium | Selection, filtering, basic groupby | Take-home or live |
| Consulting | Easy-Medium | Pivot tables, merge, basic analysis | Case study + code |
Pandas vs SQL: The Mental Map
Many interview questions ask you to solve a problem in both SQL and pandas. Here is the translation table you need to internalize:
# Pandas vs SQL - Quick Reference
# ================================
# SELECT columns → df[['col1', 'col2']] or df.loc[:, ['col1', 'col2']]
# WHERE condition → df[df['col'] > 5] or df.query('col > 5')
# GROUP BY + AGG → df.groupby('col').agg({'val': 'sum'})
# ORDER BY → df.sort_values('col', ascending=False)
# JOIN → pd.merge(df1, df2, on='key', how='inner')
# DISTINCT → df['col'].unique() or df.drop_duplicates()
# HAVING → grouped.filter(lambda x: len(x) > 5)
# CASE WHEN → np.where(condition, val_if_true, val_if_false)
# WINDOW FUNCTIONS → df.groupby('col')['val'].transform('sum')
# UNION ALL → pd.concat([df1, df2], ignore_index=True)
# LIMIT → df.head(n)
# COALESCE → df['col'].fillna(df['backup_col'])
Performance Patterns That Matter
Interviewers at senior levels test whether you know the performance implications of your pandas code. Here are the key patterns:
1. Vectorized Operations vs Loops
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': range(100_000)})
# BAD: Iterating row by row - O(n) with huge constant
result = []
for idx, row in df.iterrows():
result.append(row['value'] * 2)
df['doubled'] = result
# GOOD: Vectorized operation - uses NumPy under the hood
df['doubled'] = df['value'] * 2
# GOOD: Use .apply() only when vectorization is impossible
# (e.g., complex string parsing, external API calls)
df['label'] = df['value'].apply(lambda x: 'high' if x > 50000 else 'low')
2. Memory-Efficient Data Types
# Check memory usage
print(df.memory_usage(deep=True))
# Downcast numeric types to save memory
df['id'] = df['id'].astype('int32') # 8 bytes → 4 bytes per value
df['price'] = df['price'].astype('float32') # 8 bytes → 4 bytes per value
# Use categorical for low-cardinality string columns
df['status'] = df['status'].astype('category') # Huge savings for repeated strings
# Read only needed columns from CSV
df = pd.read_csv('large_file.csv', usecols=['id', 'name', 'value'])
3. Avoiding SettingWithCopyWarning
# BAD: Chained indexing creates ambiguous copies
df[df['status'] == 'active']['value'] = 100 # SettingWithCopyWarning!
# GOOD: Use .loc for assignment
df.loc[df['status'] == 'active', 'value'] = 100
# GOOD: Use .copy() when you intentionally want a copy
active_df = df[df['status'] == 'active'].copy()
active_df['value'] = 100
The 4-Step Approach for Pandas Challenges
Use this framework for every pandas problem in this course and in real interviews:
| Step | Time | What to Do |
|---|---|---|
| 1. Understand the Data | 2 min | Look at df.head(), df.dtypes, df.shape. Identify the grain (one row = one what?). Ask about nulls. |
| 2. Plan the Approach | 2 min | State which pandas operations you will use. Compare alternatives (groupby vs pivot_table, merge vs join). |
| 3. Write the Code | 10 min | Write clean, chained pandas code. Use meaningful variable names. Prefer vectorized operations. |
| 4. Verify | 3 min | Check output shape, spot-check values, handle edge cases (empty groups, NaN in join keys). |
Course Overview
Each lesson in this course follows a consistent structure:
- Dataset setup — A realistic DataFrame you can paste into any notebook and run immediately
- Problem statement — Clear description of the expected output, modeled after real interview questions
- Solution with explanation — Complete pandas code with line-by-line commentary
- SQL equivalent — The corresponding SQL query for comparison (where applicable)
- Performance notes — When the approach matters for large datasets
Quick Self-Assessment
Before starting, try this problem without looking at the solution:
name, department, and salary, find the top earner in each department. Return their name, department, and salary.import pandas as pd
# Setup
employees = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'department': ['Engineering', 'Engineering', 'Sales', 'Sales', 'Marketing', 'Marketing'],
'salary': [120000, 115000, 95000, 98000, 88000, 92000]
})
# Approach 1: idxmax (clean and fast)
idx = employees.groupby('department')['salary'].idxmax()
result = employees.loc[idx, ['name', 'department', 'salary']]
# Approach 2: rank + filter
employees['rank'] = employees.groupby('department')['salary'].rank(
method='dense', ascending=False
)
result = employees[employees['rank'] == 1][['name', 'department', 'salary']]
# Approach 3: merge with max
max_salaries = employees.groupby('department')['salary'].max().reset_index()
result = employees.merge(max_salaries, on=['department', 'salary'])
print(result)
# name department salary
# 0 Alice Engineering 120000
# 3 Diana Sales 98000
# 5 Frank Marketing 92000
Why this matters for ML: Finding top-K items per group is a fundamental pattern in recommendation systems, anomaly detection, and model evaluation. The groupby + idxmax pattern is used daily for tasks like finding the best model per hyperparameter configuration or the most important feature per category.