Advanced

Window Functions

Window functions compute values across a set of rows related to the current row without collapsing the result. They are essential for time series analysis, ranking, and computing running statistics. These 6 challenges cover the patterns that senior-level interviews demand.

💡
Why this matters for ML: Rolling averages are used in feature engineering for time series models. Exponential moving averages power financial models. Rank functions are essential for learning-to-rank algorithms. Shift/diff operations create lag features that feed into LSTM and gradient boosting models.

Shared Dataset for This Lesson

import pandas as pd
import numpy as np

# Daily stock prices for multiple tickers
stocks = pd.DataFrame({
    'date': pd.to_datetime(
        ['2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-08',
         '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12', '2024-01-16'] * 2
    ),
    'ticker': ['AAPL'] * 10 + ['GOOGL'] * 10,
    'close': [185.50, 186.20, 184.80, 187.10, 188.50,
              187.90, 189.20, 190.50, 189.80, 191.00,
              140.20, 141.50, 139.80, 142.30, 143.80,
              142.90, 144.50, 145.20, 144.80, 146.00],
    'volume': [52_000_000, 48_000_000, 55_000_000, 47_000_000, 51_000_000,
               49_000_000, 53_000_000, 46_000_000, 50_000_000, 54_000_000,
               28_000_000, 25_000_000, 30_000_000, 26_000_000, 29_000_000,
               27_000_000, 31_000_000, 24_000_000, 28_000_000, 32_000_000]
})
stocks = stocks.sort_values(['ticker', 'date']).reset_index(drop=True)

Challenge 1: Rolling Window Calculations

📝
Problem: For each stock: (a) Compute the 3-day simple moving average (SMA) of close price. (b) Compute the 3-day rolling standard deviation of close price. (c) Identify days where the close price crosses above the 3-day SMA (bullish crossover signal).

Solution

# (a) 3-day Simple Moving Average per ticker
stocks['sma_3'] = (
    stocks.groupby('ticker')['close']
    .transform(lambda x: x.rolling(window=3).mean())
).round(2)

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'sma_3']])
#         date   close   sma_3
# 0 2024-01-02  185.50     NaN
# 1 2024-01-03  186.20     NaN
# 2 2024-01-04  184.80  185.50
# 3 2024-01-05  187.10  186.03
# 4 2024-01-08  188.50  186.80
# 5 2024-01-09  187.90  187.83
# 6 2024-01-10  189.20  188.53
# 7 2024-01-11  190.50  189.20
# 8 2024-01-12  189.80  189.83
# 9 2024-01-16  191.00  190.43

# NOTE: First (window-1) rows are NaN because not enough data


# (b) 3-day rolling standard deviation
stocks['rolling_std'] = (
    stocks.groupby('ticker')['close']
    .transform(lambda x: x.rolling(window=3).std())
).round(3)


# (c) Bullish crossover: price crosses above SMA
# Previous day was below SMA, current day is above SMA
stocks['prev_close'] = stocks.groupby('ticker')['close'].shift(1)
stocks['crossover'] = (
    (stocks['close'] > stocks['sma_3']) &
    (stocks['prev_close'] <= stocks['sma_3'].shift(1))
)
crossovers = stocks[stocks['crossover'] == True]
print("\nBullish crossovers:")
print(crossovers[['date', 'ticker', 'close', 'sma_3']])

Challenge 2: Expanding Window (Cumulative)

📝
Problem: (a) Compute the cumulative maximum close price for each stock (all-time high at each point). (b) Compute the drawdown: how far the current price is below the all-time high (in percentage). (c) Find the maximum drawdown for each stock.

Solution

# (a) Cumulative max (all-time high) using expanding window
stocks['cummax'] = stocks.groupby('ticker')['close'].transform(
    lambda x: x.expanding().max()
)

# Alternative using cummax() directly:
stocks['cummax_v2'] = stocks.groupby('ticker')['close'].cummax()

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'cummax']])
#         date   close  cummax
# 0 2024-01-02  185.50  185.50
# 1 2024-01-03  186.20  186.20
# 2 2024-01-04  184.80  186.20  <-- all-time high stays at 186.20
# 3 2024-01-05  187.10  187.10
# 4 2024-01-08  188.50  188.50
# ...


# (b) Drawdown: (current - peak) / peak * 100
stocks['drawdown_pct'] = (
    (stocks['close'] - stocks['cummax']) / stocks['cummax'] * 100
).round(3)

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'cummax', 'drawdown_pct']])
# Drawdown is always <= 0 (0 means at all-time high)


# (c) Maximum drawdown per stock
max_drawdown = stocks.groupby('ticker')['drawdown_pct'].min()
print(f"\nMaximum drawdown by stock:")
print(max_drawdown)
# ticker
# AAPL    -0.752    (worst dip was 0.75% below ATH)
# GOOGL   -1.354    (worst dip was 1.35% below ATH)

# SQL equivalent:
# SELECT ticker, MIN(drawdown_pct) AS max_drawdown
# FROM (
#   SELECT ticker, close,
#          MAX(close) OVER (PARTITION BY ticker ORDER BY date) AS cummax,
#          (close - MAX(close) OVER (PARTITION BY ticker ORDER BY date))
#           / MAX(close) OVER (PARTITION BY ticker ORDER BY date) * 100 AS drawdown_pct
#   FROM stocks
# ) t
# GROUP BY ticker

Challenge 3: Exponentially Weighted Moving Average (EWM)

📝
Problem: (a) Compute the exponentially weighted moving average with span=3 for each stock. (b) Compare SMA vs EWM: which reacts faster to price changes? (c) Create a signal: when EWM crosses above SMA, it suggests an uptrend.

Solution

# (a) EWM with span=3 (gives more weight to recent observations)
stocks['ewm_3'] = (
    stocks.groupby('ticker')['close']
    .transform(lambda x: x.ewm(span=3, adjust=False).mean())
).round(2)

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'sma_3', 'ewm_3']])
#         date   close   sma_3   ewm_3
# 0 2024-01-02  185.50     NaN  185.50
# 1 2024-01-03  186.20     NaN  185.85
# 2 2024-01-04  184.80  185.50  185.33
# 3 2024-01-05  187.10  186.03  186.21
# 4 2024-01-08  188.50  186.80  187.36
# ...

# KEY: span=3 means alpha = 2/(3+1) = 0.5
# Recent prices have exponentially higher weight
# EWM reacts faster to recent changes than SMA


# (b) Reaction speed comparison
# After a sharp move, EWM follows more quickly
stocks['sma_lag'] = abs(stocks['close'] - stocks['sma_3'])
stocks['ewm_lag'] = abs(stocks['close'] - stocks['ewm_3'])

avg_lag = stocks.groupby('ticker').agg(
    avg_sma_distance=('sma_lag', 'mean'),
    avg_ewm_distance=('ewm_lag', 'mean')
).round(3)
print("\nAverage distance from price (lower = tracks more closely):")
print(avg_lag)
# EWM typically has lower average distance, meaning it tracks price better


# (c) EWM-SMA crossover signal
stocks['ewm_above_sma'] = stocks['ewm_3'] > stocks['sma_3']
stocks['prev_ewm_above'] = stocks.groupby('ticker')['ewm_above_sma'].shift(1)
stocks['trend_signal'] = (
    (stocks['ewm_above_sma'] == True) &
    (stocks['prev_ewm_above'] == False)
)

signals = stocks[stocks['trend_signal']][['date', 'ticker', 'close', 'sma_3', 'ewm_3']]
print("\nUptrend signals (EWM crosses above SMA):")
print(signals)

Challenge 4: Ranking Within Groups

📝
Problem: (a) Rank each day's volume within its stock ticker (highest volume = rank 1). (b) Compute percentile ranks: what percentile is each day's close price within its stock history? (c) Find the top 3 highest-volume days across all stocks, with their rank within each stock.

Solution

# (a) Rank by volume within each ticker (dense rank, highest first)
stocks['volume_rank'] = stocks.groupby('ticker')['volume'].rank(
    method='dense', ascending=False
).astype(int)

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'volume', 'volume_rank']].head())
#         date    volume  volume_rank
# 0 2024-01-02  52000000            3
# 1 2024-01-03  48000000            6
# 2 2024-01-04  55000000            1
# 3 2024-01-05  47000000            7
# 4 2024-01-08  51000000            4

# Rank methods explained:
# 'dense'   : 1, 2, 3, 3, 4  (no gaps after ties)
# 'min'     : 1, 2, 3, 3, 5  (SQL-style RANK)
# 'first'   : 1, 2, 3, 4, 5  (by position, no ties)
# 'average' : 1, 2, 3.5, 3.5, 5  (default)


# (b) Percentile rank: what % of values are below this?
stocks['price_percentile'] = (
    stocks.groupby('ticker')['close']
    .rank(pct=True)
    * 100
).round(1)

print(stocks[stocks['ticker'] == 'GOOGL'][['date', 'close', 'price_percentile']])
# A percentile of 90% means 90% of prices were lower than this day


# (c) Top 3 volume days across all stocks
stocks['global_volume_rank'] = stocks['volume'].rank(
    method='dense', ascending=False
).astype(int)

top3_volume = stocks.nsmallest(3, 'global_volume_rank')
print("\nTop 3 volume days:")
print(top3_volume[['date', 'ticker', 'volume', 'volume_rank', 'global_volume_rank']])

# SQL equivalent:
# SELECT *, DENSE_RANK() OVER (PARTITION BY ticker ORDER BY volume DESC) AS volume_rank,
#           DENSE_RANK() OVER (ORDER BY volume DESC) AS global_rank
# FROM stocks

Challenge 5: shift() and diff() for Lag Features

📝
Problem: (a) Compute the daily return (percentage change from previous day) for each stock. (b) Create lag features: previous 1, 2, and 3 day closing prices as separate columns. (c) Compute the 3-day momentum: price change over the last 3 trading days, both absolute and percentage.

Solution

# (a) Daily returns using pct_change (which uses shift internally)
stocks['daily_return'] = (
    stocks.groupby('ticker')['close']
    .pct_change()
    * 100
).round(3)

# Manual calculation using shift:
stocks['daily_return_v2'] = (
    (stocks['close'] - stocks.groupby('ticker')['close'].shift(1))
    / stocks.groupby('ticker')['close'].shift(1)
    * 100
).round(3)

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'daily_return']].head())
#         date   close  daily_return
# 0 2024-01-02  185.50          NaN
# 1 2024-01-03  186.20        0.377
# 2 2024-01-04  184.80       -0.752
# 3 2024-01-05  187.10        1.244
# 4 2024-01-08  188.50        0.748


# (b) Lag features (critical for time series ML models)
for lag in [1, 2, 3]:
    stocks[f'close_lag_{lag}'] = stocks.groupby('ticker')['close'].shift(lag)

print(stocks[stocks['ticker'] == 'AAPL'][
    ['date', 'close', 'close_lag_1', 'close_lag_2', 'close_lag_3']
].head(5))
#         date   close  close_lag_1  close_lag_2  close_lag_3
# 0 2024-01-02  185.50          NaN          NaN          NaN
# 1 2024-01-03  186.20       185.50          NaN          NaN
# 2 2024-01-04  184.80       186.20       185.50          NaN
# 3 2024-01-05  187.10       184.80       186.20       185.50
# 4 2024-01-08  188.50       187.10       184.80       186.20


# (c) 3-day momentum using diff()
stocks['momentum_3d'] = stocks.groupby('ticker')['close'].diff(3).round(2)
stocks['momentum_3d_pct'] = (
    stocks['momentum_3d'] / stocks.groupby('ticker')['close'].shift(3) * 100
).round(3)

print(stocks[stocks['ticker'] == 'AAPL'][
    ['date', 'close', 'momentum_3d', 'momentum_3d_pct']
].head(6))
#         date   close  momentum_3d  momentum_3d_pct
# 0 2024-01-02  185.50          NaN              NaN
# 1 2024-01-03  186.20          NaN              NaN
# 2 2024-01-04  184.80          NaN              NaN
# 3 2024-01-05  187.10         1.60            0.863
# 4 2024-01-08  188.50         2.30            1.235
# 5 2024-01-09  187.90         3.10            1.677
💡
ML feature engineering: Lag features and momentum are among the most powerful features for time series models (XGBoost, LSTM). In interviews, always mention that you must be careful about data leakage: lag features must only use past data, never future data. Also, the first N rows will have NaN values that need handling (drop or fill).

Challenge 6: Cumulative Operations

📝
Problem: (a) Compute the cumulative sum of volume for each stock (running total). (b) Compute the cumulative product of (1 + daily_return) to get the total return index. (c) For each stock, find the day where cumulative volume first exceeded 200 million.

Solution

# (a) Cumulative volume (running total)
stocks['cum_volume'] = stocks.groupby('ticker')['volume'].cumsum()

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'volume', 'cum_volume']].head())
#         date    volume   cum_volume
# 0 2024-01-02  52000000    52000000
# 1 2024-01-03  48000000   100000000
# 2 2024-01-04  55000000   155000000
# 3 2024-01-05  47000000   202000000
# 4 2024-01-08  51000000   253000000


# (b) Total return index using cumulative product
# 1 + return gives the daily multiplier; cumulative product gives total growth
stocks['return_factor'] = 1 + stocks['daily_return'].fillna(0) / 100
stocks['return_index'] = (
    stocks.groupby('ticker')['return_factor']
    .cumprod()
).round(4)

print(stocks[stocks['ticker'] == 'AAPL'][['date', 'close', 'daily_return', 'return_index']])
#         date   close  daily_return  return_index
# 0 2024-01-02  185.50          NaN        1.0000
# 1 2024-01-03  186.20        0.377        1.0038
# 2 2024-01-04  184.80       -0.752        0.9962
# 3 2024-01-05  187.10        1.244        1.0086
# ...
# 9 2024-01-16  191.00        0.631        1.0296
# Total return: 2.96% over the period


# (c) First day cumulative volume exceeded 200M
def first_above_threshold(group, col, threshold):
    """Find the first row where cumulative value exceeds threshold."""
    cumulative = group[col].cumsum()
    above = cumulative[cumulative > threshold]
    if len(above) > 0:
        return group.loc[above.index[0]]
    return None

threshold = 200_000_000
for ticker in stocks['ticker'].unique():
    group = stocks[stocks['ticker'] == ticker]
    cum = group['volume'].cumsum()
    first_day = group[cum > threshold].iloc[0]
    print(f"{ticker}: Cumulative volume first exceeded 200M on {first_day['date'].date()}"
          f" (cumulative: {cum[cum > threshold].iloc[0]:,})")
# AAPL: Cumulative volume first exceeded 200M on 2024-01-05 (cumulative: 202,000,000)
# GOOGL: Cumulative volume first exceeded 200M on 2024-01-10 (cumulative: 196,000,000)

# SQL equivalent:
# SELECT ticker, MIN(date) AS first_date
# FROM (
#   SELECT *, SUM(volume) OVER (PARTITION BY ticker ORDER BY date) AS cum_vol
#   FROM stocks
# ) t
# WHERE cum_vol > 200000000
# GROUP BY ticker