Advanced

Patterns & Tips

This final lesson covers the performance patterns, method chaining techniques, and common mistakes that interviewers test at the senior level. Master these to demonstrate production-quality pandas skills.

Performance Optimization Cheat Sheet

The single most important performance rule in pandas: avoid row-by-row operations. Here is the hierarchy from fastest to slowest:

Approach	Speed	When to Use
Vectorized NumPy	Fastest	Arithmetic, comparisons, boolean masks
Built-in pandas methods	Very fast	sum, mean, groupby with built-in agg
.map() / .replace()	Fast	Element-wise transformations with a dict
.apply() on Series	Medium	Complex per-element logic
.apply() on DataFrame	Slow	Row-wise logic needing multiple columns
itertuples()	Slow	When you absolutely need row iteration
iterrows()	Slowest	Almost never — avoid in production

import pandas as pd
import numpy as np

# Performance comparison on 1 million rows
df = pd.DataFrame({
    'a': np.random.randn(1_000_000),
    'b': np.random.randn(1_000_000),
    'category': np.random.choice(['X', 'Y', 'Z'], 1_000_000)
})

# FASTEST: Vectorized (NumPy under the hood)
# ~2ms for 1M rows
df['c'] = df['a'] + df['b']
df['d'] = np.where(df['a'] > 0, df['a'] * 2, df['a'] * -1)

# FAST: Built-in aggregation
# ~5ms for 1M rows
result = df.groupby('category')['a'].mean()

# MEDIUM: .apply() on Series
# ~200ms for 1M rows
df['label'] = df['a'].apply(lambda x: 'pos' if x > 0 else 'neg')

# SLOW: .apply() on DataFrame (row-wise)
# ~10s for 1M rows (50x slower than vectorized!)
# df['e'] = df.apply(lambda row: row['a'] * row['b'], axis=1)  # DON'T DO THIS
df['e'] = df['a'] * df['b']  # DO THIS INSTEAD

# SLOWEST: iterrows
# ~60s for 1M rows (DO NOT USE)
# for idx, row in df.iterrows():  # NEVER in production
#     df.loc[idx, 'f'] = row['a'] + row['b']

Method Chaining Best Practices

Method chaining makes pandas code more readable and avoids creating intermediate variables. Here is how to chain effectively:

# BAD: Too many intermediate variables
filtered = df[df['category'] == 'X']
grouped = filtered.groupby('category')
result = grouped['a'].mean()

# GOOD: Clean method chain
result = (
    df
    .query("category == 'X'")
    .groupby('category')['a']
    .mean()
)

# Real-world example: customer lifetime value pipeline
orders = pd.DataFrame({
    'customer_id': [1, 1, 2, 2, 2, 3],
    'order_date': pd.to_datetime(['2024-01-01', '2024-02-15', '2024-01-10',
                                   '2024-03-01', '2024-04-20', '2024-02-01']),
    'amount': [100, 250, 80, 150, 200, 300]
})

clv_report = (
    orders
    .assign(
        month=lambda x: x['order_date'].dt.to_period('M'),
        is_high_value=lambda x: x['amount'] > 150
    )
    .groupby('customer_id')
    .agg(
        total_spend=('amount', 'sum'),
        order_count=('amount', 'count'),
        avg_order=('amount', 'mean'),
        first_order=('order_date', 'min'),
        last_order=('order_date', 'max'),
        high_value_orders=('is_high_value', 'sum')
    )
    .assign(
        tenure_days=lambda x: (x['last_order'] - x['first_order']).dt.days,
        high_value_pct=lambda x: (x['high_value_orders'] / x['order_count'] * 100).round(1)
    )
    .sort_values('total_spend', ascending=False)
)

print(clv_report[['total_spend', 'order_count', 'avg_order', 'tenure_days', 'high_value_pct']])
#              total_spend  order_count  avg_order  tenure_days  high_value_pct
# customer_id
# 2                    430            3     143.33          101            33.3
# 1                    350            2     175.00           45            50.0
# 3                    300            1     300.00            0           100.0

💡

Chain rule: Use .assign() to add columns within a chain. Use lambda x: inside assign to reference the current state of the DataFrame. This avoids the need for intermediate variables and makes your pipeline a single readable expression.

Common Anti-Patterns to Avoid

1. Growing a DataFrame in a Loop

# BAD: Appending to DataFrame in a loop (O(n^2) due to copy on each append)
result = pd.DataFrame()
for i in range(1000):
    row = pd.DataFrame({'a': [i], 'b': [i * 2]})
    result = pd.concat([result, row])  # Creates a new copy each time!

# GOOD: Collect in a list, concat once at the end (O(n))
rows = []
for i in range(1000):
    rows.append({'a': i, 'b': i * 2})
result = pd.DataFrame(rows)  # Single allocation

2. Using apply() When Vectorization Works

# BAD: apply for simple math
df['total'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)

# GOOD: Vectorized multiplication
df['total'] = df['price'] * df['quantity']

# BAD: apply for conditional logic
df['tier'] = df['price'].apply(lambda x: 'high' if x > 100 else 'low')

# GOOD: np.where or np.select
df['tier'] = np.where(df['price'] > 100, 'high', 'low')

3. Ignoring Data Types

# BAD: Leaving string columns as object type
df = pd.read_csv('file.csv')  # All strings are 'object' type

# GOOD: Convert to optimal types after reading
df = pd.read_csv('file.csv').pipe(lambda d: d.assign(
    date=pd.to_datetime(d['date']),
    status=d['status'].astype('category'),
    user_id=d['user_id'].astype('int32'),
    amount=d['amount'].astype('float32')
))
# Can reduce memory usage by 50-80%

4. Mutating During Iteration

# BAD: Modifying DataFrame while iterating
for idx, row in df.iterrows():
    if row['status'] == 'active':
        df.loc[idx, 'flag'] = True  # Modifying during iteration

# GOOD: Vectorized boolean assignment
df['flag'] = df['status'] == 'active'

Frequently Asked Questions

Use SQL when: data lives in a database, you need to filter before loading (SELECT only needed columns/rows), or the dataset is too large for memory. Use pandas when: data fits in memory, you need complex transformations (custom functions, ML feature engineering), or you want interactive exploration. In interviews, being able to express the same logic in both SQL and pandas shows versatility.

Options: (1) Use pd.read_csv(chunksize=10000) to process in chunks. (2) Use usecols to load only needed columns. (3) Downcast data types (float64 to float32, object to category). (4) Switch to Polars or Dask for lazy evaluation and out-of-core processing. (5) Use SQL/Spark for the heavy filtering, then pandas for the final analysis.

transform() returns a result with the same shape as the input. It broadcasts group-level results back to each row. Use it for "add a column showing each row's group mean." apply() can return any shape: a scalar (reduces), a Series (same shape), or a DataFrame (expands). Use it for complex group operations that transform requires a custom function for. transform() is also faster for built-in functions because it can use optimized C paths.

This warning means you might be modifying a view (reference) instead of the original DataFrame, so your changes may not stick. Fix it by: (1) Using .loc[] for all assignments: df.loc[mask, 'col'] = value. (2) Using .copy() when you intentionally want a separate DataFrame: subset = df[df['x'] > 5].copy(). (3) Avoiding chained indexing like df['col1']['col2'] — use df.loc[:, ['col1', 'col2']] instead.

It depends on the context. Options: (1) dropna() if missing rows are few and random. (2) fillna(value) with a constant (0, -1, 'Unknown'). (3) fillna(method='ffill') for time series (carry forward last known value). (4) interpolate() for continuous measurements. (5) fillna(df.groupby('category')['col'].transform('mean')) for group-specific imputation. In ML, always consider whether missingness itself is informative — create a binary is_missing feature before filling.

Learn both. Pandas is still the standard in interviews and most existing codebases. Polars is faster for large datasets (10M+ rows) due to Rust backend and lazy evaluation. Interview advice: solve the problem in pandas first (what they expect), then mention "for production with large data, I would consider Polars for its parallel execution and lower memory usage." This shows awareness of modern tools without risking an unfamiliar syntax during a timed interview.

Practice these steps: (1) Memorize the core API: groupby, merge, pivot_table, melt, resample, rolling, shift, rank. (2) Practice translating SQL to pandas and back. (3) Always start by examining the data: df.head(), df.dtypes, df.shape, df.isnull().sum(). (4) Write clean method chains. (5) Mention performance considerations unprompted. (6) Verify your output shape and spot-check values. Do 2-3 timed practice sessions per day for 2 weeks before your interview.

Quick Reference Card

# ==============================================
# PANDAS INTERVIEW QUICK REFERENCE
# ==============================================

# SELECTION
df.loc[rows, cols]           # Label-based (inclusive)
df.iloc[rows, cols]          # Position-based (exclusive end)
df.query("col > 5")          # SQL-like filtering
df[df['col'].isin(list)]     # Set membership
df[df['col'].between(a, b)]  # Range filter

# GROUPBY
df.groupby('col').agg(...)         # Aggregate (reduces rows)
df.groupby('col').transform(...)   # Broadcast (keeps shape)
df.groupby('col').apply(...)       # Flexible (any output)
df.groupby('col').filter(...)      # Keep/drop entire groups

# MERGE
pd.merge(a, b, on='key')                    # Inner join
pd.merge(a, b, on='key', how='left')        # Left join
pd.merge(a, b, on='key', indicator=True)     # Anti-join setup
pd.merge(a, b, how='cross')                  # Cartesian product
pd.concat([a, b], ignore_index=True)          # UNION ALL
a.combine_first(b)                            # COALESCE

# RESHAPE
pd.pivot_table(df, values, index, columns)   # Long to wide
df.melt(id_vars, var_name, value_name)        # Wide to long
df.stack() / df.unstack()                     # Multi-level reshape
pd.crosstab(df['a'], df['b'])                # Frequency table
pd.get_dummies(df, columns=['col'])           # One-hot encode

# WINDOW
df['col'].rolling(n).mean()        # Moving average
df['col'].expanding().max()        # Cumulative max
df['col'].ewm(span=n).mean()       # Exponential MA
df['col'].shift(n)                 # Lag feature
df['col'].diff(n)                  # Period-over-period change
df['col'].rank(method='dense')     # Ranking
df['col'].cumsum()                 # Running total

# TIME SERIES
df.resample('D').mean()                      # Resample to daily
pd.date_range(start, end, freq='h')          # Generate dates
ts.dt.tz_localize('UTC')                     # Add timezone
ts.dt.tz_convert('US/Eastern')               # Convert timezone
df['col'].interpolate()                       # Fill time gaps

📝

Final advice: The best way to prepare is to practice with real datasets. Download datasets from Kaggle, write 5 pandas queries per day, and time yourself. After 2 weeks of daily practice, pandas operations will feel as natural as writing SQL. Good luck with your interviews.

← Previous Time Series Operations