Advanced
Window Functions
Window functions compute values across a set of rows related to the current row without collapsing the result. They are essential for time series analysis, ranking, and computing running statistics. These 6 challenges cover the patterns that senior-level interviews demand.
Why this matters for ML: Rolling averages are used in feature engineering for time series models. Exponential moving averages power financial models. Rank functions are essential for learning-to-rank algorithms. Shift/diff operations create lag features that feed into LSTM and gradient boosting models.
Shared Dataset for This Lesson
import pandas as pd
import numpy as np
# Daily stock prices for multiple tickers
stocks = pd.DataFrame({
'date': pd.to_datetime(
['2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-08',
'2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12', '2024-01-16'] * 2
),
'ticker': ['AAPL'] * 10 + ['GOOGL'] * 10,
'close': [185.50, 186.20, 184.80, 187.10, 188.50,
187.90, 189.20, 190.50, 189.80, 191.00,
140.20, 141.50, 139.80, 142.30, 143.80,
142.90, 144.50, 145.20, 144.80, 146.00],
'volume': [52_000_000, 48_000_000, 55_000_000, 47_000_000, 51_000_000,
49_000_000, 53_000_000, 46_000_000, 50_000_000, 54_000_000,
28_000_000, 25_000_000, 30_000_000, 26_000_000, 29_000_000,
27_000_000, 31_000_000, 24_000_000, 28_000_000, 32_000_000]
})
stocks = stocks.sort_values(['ticker', 'date']).reset_index(drop=True)
Challenge 1: Rolling Window Calculations
Problem: For each stock: (a) Compute the 3-day simple moving average (SMA) of close price. (b) Compute the 3-day rolling standard deviation of close price. (c) Identify days where the close price crosses above the 3-day SMA (bullish crossover signal).
Solution
# (a) 3-day Simple Moving Average per ticker
stocks['sma_3'] = (
stocks.groupby('ticker')['close']
.transform(lambda x: x.rolling(window=3).mean())
).round(2)
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'sma_3']])
# date close sma_3
# 0 2024-01-02 185.50 NaN
# 1 2024-01-03 186.20 NaN
# 2 2024-01-04 184.80 185.50
# 3 2024-01-05 187.10 186.03
# 4 2024-01-08 188.50 186.80
# 5 2024-01-09 187.90 187.83
# 6 2024-01-10 189.20 188.53
# 7 2024-01-11 190.50 189.20
# 8 2024-01-12 189.80 189.83
# 9 2024-01-16 191.00 190.43
# NOTE: First (window-1) rows are NaN because not enough data
# (b) 3-day rolling standard deviation
stocks['rolling_std'] = (
stocks.groupby('ticker')['close']
.transform(lambda x: x.rolling(window=3).std())
).round(3)
# (c) Bullish crossover: price crosses above SMA
# Previous day was below SMA, current day is above SMA
stocks['prev_close'] = stocks.groupby('ticker')['close'].shift(1)
stocks['crossover'] = (
(stocks['close'] > stocks['sma_3']) &
(stocks['prev_close'] <= stocks['sma_3'].shift(1))
)
crossovers = stocks[stocks['crossover'] == True]
print("\nBullish crossovers:")
print(crossovers[['date', 'ticker', 'close', 'sma_3']])
Challenge 2: Expanding Window (Cumulative)
Problem: (a) Compute the cumulative maximum close price for each stock (all-time high at each point). (b) Compute the drawdown: how far the current price is below the all-time high (in percentage). (c) Find the maximum drawdown for each stock.
Solution
# (a) Cumulative max (all-time high) using expanding window
stocks['cummax'] = stocks.groupby('ticker')['close'].transform(
lambda x: x.expanding().max()
)
# Alternative using cummax() directly:
stocks['cummax_v2'] = stocks.groupby('ticker')['close'].cummax()
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'cummax']])
# date close cummax
# 0 2024-01-02 185.50 185.50
# 1 2024-01-03 186.20 186.20
# 2 2024-01-04 184.80 186.20 <-- all-time high stays at 186.20
# 3 2024-01-05 187.10 187.10
# 4 2024-01-08 188.50 188.50
# ...
# (b) Drawdown: (current - peak) / peak * 100
stocks['drawdown_pct'] = (
(stocks['close'] - stocks['cummax']) / stocks['cummax'] * 100
).round(3)
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'cummax', 'drawdown_pct']])
# Drawdown is always <= 0 (0 means at all-time high)
# (c) Maximum drawdown per stock
max_drawdown = stocks.groupby('ticker')['drawdown_pct'].min()
print(f"\nMaximum drawdown by stock:")
print(max_drawdown)
# ticker
# AAPL -0.752 (worst dip was 0.75% below ATH)
# GOOGL -1.354 (worst dip was 1.35% below ATH)
# SQL equivalent:
# SELECT ticker, MIN(drawdown_pct) AS max_drawdown
# FROM (
# SELECT ticker, close,
# MAX(close) OVER (PARTITION BY ticker ORDER BY date) AS cummax,
# (close - MAX(close) OVER (PARTITION BY ticker ORDER BY date))
# / MAX(close) OVER (PARTITION BY ticker ORDER BY date) * 100 AS drawdown_pct
# FROM stocks
# ) t
# GROUP BY ticker
Challenge 3: Exponentially Weighted Moving Average (EWM)
Problem: (a) Compute the exponentially weighted moving average with span=3 for each stock. (b) Compare SMA vs EWM: which reacts faster to price changes? (c) Create a signal: when EWM crosses above SMA, it suggests an uptrend.
Solution
# (a) EWM with span=3 (gives more weight to recent observations)
stocks['ewm_3'] = (
stocks.groupby('ticker')['close']
.transform(lambda x: x.ewm(span=3, adjust=False).mean())
).round(2)
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'sma_3', 'ewm_3']])
# date close sma_3 ewm_3
# 0 2024-01-02 185.50 NaN 185.50
# 1 2024-01-03 186.20 NaN 185.85
# 2 2024-01-04 184.80 185.50 185.33
# 3 2024-01-05 187.10 186.03 186.21
# 4 2024-01-08 188.50 186.80 187.36
# ...
# KEY: span=3 means alpha = 2/(3+1) = 0.5
# Recent prices have exponentially higher weight
# EWM reacts faster to recent changes than SMA
# (b) Reaction speed comparison
# After a sharp move, EWM follows more quickly
stocks['sma_lag'] = abs(stocks['close'] - stocks['sma_3'])
stocks['ewm_lag'] = abs(stocks['close'] - stocks['ewm_3'])
avg_lag = stocks.groupby('ticker').agg(
avg_sma_distance=('sma_lag', 'mean'),
avg_ewm_distance=('ewm_lag', 'mean')
).round(3)
print("\nAverage distance from price (lower = tracks more closely):")
print(avg_lag)
# EWM typically has lower average distance, meaning it tracks price better
# (c) EWM-SMA crossover signal
stocks['ewm_above_sma'] = stocks['ewm_3'] > stocks['sma_3']
stocks['prev_ewm_above'] = stocks.groupby('ticker')['ewm_above_sma'].shift(1)
stocks['trend_signal'] = (
(stocks['ewm_above_sma'] == True) &
(stocks['prev_ewm_above'] == False)
)
signals = stocks[stocks['trend_signal']][['date', 'ticker', 'close', 'sma_3', 'ewm_3']]
print("\nUptrend signals (EWM crosses above SMA):")
print(signals)
Challenge 4: Ranking Within Groups
Problem: (a) Rank each day's volume within its stock ticker (highest volume = rank 1). (b) Compute percentile ranks: what percentile is each day's close price within its stock history? (c) Find the top 3 highest-volume days across all stocks, with their rank within each stock.
Solution
# (a) Rank by volume within each ticker (dense rank, highest first)
stocks['volume_rank'] = stocks.groupby('ticker')['volume'].rank(
method='dense', ascending=False
).astype(int)
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'volume', 'volume_rank']].head())
# date volume volume_rank
# 0 2024-01-02 52000000 3
# 1 2024-01-03 48000000 6
# 2 2024-01-04 55000000 1
# 3 2024-01-05 47000000 7
# 4 2024-01-08 51000000 4
# Rank methods explained:
# 'dense' : 1, 2, 3, 3, 4 (no gaps after ties)
# 'min' : 1, 2, 3, 3, 5 (SQL-style RANK)
# 'first' : 1, 2, 3, 4, 5 (by position, no ties)
# 'average' : 1, 2, 3.5, 3.5, 5 (default)
# (b) Percentile rank: what % of values are below this?
stocks['price_percentile'] = (
stocks.groupby('ticker')['close']
.rank(pct=True)
* 100
).round(1)
print(stocks[stocks['ticker'] == 'GOOGL'][['date', 'close', 'price_percentile']])
# A percentile of 90% means 90% of prices were lower than this day
# (c) Top 3 volume days across all stocks
stocks['global_volume_rank'] = stocks['volume'].rank(
method='dense', ascending=False
).astype(int)
top3_volume = stocks.nsmallest(3, 'global_volume_rank')
print("\nTop 3 volume days:")
print(top3_volume[['date', 'ticker', 'volume', 'volume_rank', 'global_volume_rank']])
# SQL equivalent:
# SELECT *, DENSE_RANK() OVER (PARTITION BY ticker ORDER BY volume DESC) AS volume_rank,
# DENSE_RANK() OVER (ORDER BY volume DESC) AS global_rank
# FROM stocks
Challenge 5: shift() and diff() for Lag Features
Problem: (a) Compute the daily return (percentage change from previous day) for each stock. (b) Create lag features: previous 1, 2, and 3 day closing prices as separate columns. (c) Compute the 3-day momentum: price change over the last 3 trading days, both absolute and percentage.
Solution
# (a) Daily returns using pct_change (which uses shift internally)
stocks['daily_return'] = (
stocks.groupby('ticker')['close']
.pct_change()
* 100
).round(3)
# Manual calculation using shift:
stocks['daily_return_v2'] = (
(stocks['close'] - stocks.groupby('ticker')['close'].shift(1))
/ stocks.groupby('ticker')['close'].shift(1)
* 100
).round(3)
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'daily_return']].head())
# date close daily_return
# 0 2024-01-02 185.50 NaN
# 1 2024-01-03 186.20 0.377
# 2 2024-01-04 184.80 -0.752
# 3 2024-01-05 187.10 1.244
# 4 2024-01-08 188.50 0.748
# (b) Lag features (critical for time series ML models)
for lag in [1, 2, 3]:
stocks[f'close_lag_{lag}'] = stocks.groupby('ticker')['close'].shift(lag)
print(stocks[stocks['ticker'] == 'AAPL'][
['date', 'close', 'close_lag_1', 'close_lag_2', 'close_lag_3']
].head(5))
# date close close_lag_1 close_lag_2 close_lag_3
# 0 2024-01-02 185.50 NaN NaN NaN
# 1 2024-01-03 186.20 185.50 NaN NaN
# 2 2024-01-04 184.80 186.20 185.50 NaN
# 3 2024-01-05 187.10 184.80 186.20 185.50
# 4 2024-01-08 188.50 187.10 184.80 186.20
# (c) 3-day momentum using diff()
stocks['momentum_3d'] = stocks.groupby('ticker')['close'].diff(3).round(2)
stocks['momentum_3d_pct'] = (
stocks['momentum_3d'] / stocks.groupby('ticker')['close'].shift(3) * 100
).round(3)
print(stocks[stocks['ticker'] == 'AAPL'][
['date', 'close', 'momentum_3d', 'momentum_3d_pct']
].head(6))
# date close momentum_3d momentum_3d_pct
# 0 2024-01-02 185.50 NaN NaN
# 1 2024-01-03 186.20 NaN NaN
# 2 2024-01-04 184.80 NaN NaN
# 3 2024-01-05 187.10 1.60 0.863
# 4 2024-01-08 188.50 2.30 1.235
# 5 2024-01-09 187.90 3.10 1.677
ML feature engineering: Lag features and momentum are among the most powerful features for time series models (XGBoost, LSTM). In interviews, always mention that you must be careful about data leakage: lag features must only use past data, never future data. Also, the first N rows will have NaN values that need handling (drop or fill).
Challenge 6: Cumulative Operations
Problem: (a) Compute the cumulative sum of volume for each stock (running total). (b) Compute the cumulative product of (1 + daily_return) to get the total return index. (c) For each stock, find the day where cumulative volume first exceeded 200 million.
Solution
# (a) Cumulative volume (running total)
stocks['cum_volume'] = stocks.groupby('ticker')['volume'].cumsum()
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'volume', 'cum_volume']].head())
# date volume cum_volume
# 0 2024-01-02 52000000 52000000
# 1 2024-01-03 48000000 100000000
# 2 2024-01-04 55000000 155000000
# 3 2024-01-05 47000000 202000000
# 4 2024-01-08 51000000 253000000
# (b) Total return index using cumulative product
# 1 + return gives the daily multiplier; cumulative product gives total growth
stocks['return_factor'] = 1 + stocks['daily_return'].fillna(0) / 100
stocks['return_index'] = (
stocks.groupby('ticker')['return_factor']
.cumprod()
).round(4)
print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'daily_return', 'return_index']])
# date close daily_return return_index
# 0 2024-01-02 185.50 NaN 1.0000
# 1 2024-01-03 186.20 0.377 1.0038
# 2 2024-01-04 184.80 -0.752 0.9962
# 3 2024-01-05 187.10 1.244 1.0086
# ...
# 9 2024-01-16 191.00 0.631 1.0296
# Total return: 2.96% over the period
# (c) First day cumulative volume exceeded 200M
def first_above_threshold(group, col, threshold):
"""Find the first row where cumulative value exceeds threshold."""
cumulative = group[col].cumsum()
above = cumulative[cumulative > threshold]
if len(above) > 0:
return group.loc[above.index[0]]
return None
threshold = 200_000_000
for ticker in stocks['ticker'].unique():
group = stocks[stocks['ticker'] == ticker]
cum = group['volume'].cumsum()
first_day = group[cum > threshold].iloc[0]
print(f"{ticker}: Cumulative volume first exceeded 200M on {first_day['date'].date()}"
f" (cumulative: {cum[cum > threshold].iloc[0]:,})")
# AAPL: Cumulative volume first exceeded 200M on 2024-01-05 (cumulative: 202,000,000)
# GOOGL: Cumulative volume first exceeded 200M on 2024-01-10 (cumulative: 196,000,000)
# SQL equivalent:
# SELECT ticker, MIN(date) AS first_date
# FROM (
# SELECT *, SUM(volume) OVER (PARTITION BY ticker ORDER BY date) AS cum_vol
# FROM stocks
# ) t
# WHERE cum_vol > 200000000
# GROUP BY ticker
Lilly Tech Systems