Advanced

Time Series Operations

Time series data is everywhere: financial markets, IoT sensors, web analytics, and user behavior logs. These 5 challenges cover the pandas operations that fintech, trading, and data engineering interviews demand.

💡
Why this matters for ML: Time series feature engineering is critical for forecasting models (ARIMA, Prophet, LSTM). Resampling creates training data at the right granularity. Lag features feed into gradient boosting models. Timezone handling prevents data leakage when training on global data.

Shared Dataset for This Lesson

import pandas as pd
import numpy as np

# Sensor data: temperature readings every hour for a week
np.random.seed(42)
dates = pd.date_range('2024-03-01', periods=168, freq='h')  # 7 days * 24 hours
sensor_data = pd.DataFrame({
    'timestamp': dates,
    'sensor_id': np.tile(['S1', 'S2'], 84),
    'temperature': np.round(
        20 + 5 * np.sin(np.arange(168) * 2 * np.pi / 24)  # Daily cycle
        + np.random.normal(0, 1.5, 168), 1                  # Random noise
    ),
    'humidity': np.round(np.random.uniform(30, 80, 168), 1)
})
sensor_data = sensor_data.sort_values(['sensor_id', 'timestamp']).reset_index(drop=True)

Challenge 1: resample() for Temporal Aggregation

📝
Problem: (a) Resample hourly sensor data to daily averages per sensor. (b) Resample to 6-hour windows and compute min, max, and mean temperature. (c) Resample to weekly data and compute the temperature range (max - min) per sensor per week.

Solution

# resample() requires a DatetimeIndex
sensor_ts = sensor_data.set_index('timestamp')

# (a) Daily average per sensor
daily_avg = (
    sensor_ts.groupby('sensor_id')
    .resample('D')['temperature']
    .mean()
    .round(1)
)
print("Daily average temperature:")
print(daily_avg.head(10))
# sensor_id  timestamp
# S1         2024-03-01    19.8
#            2024-03-02    20.2
#            2024-03-03    19.5
# ...

# Reset index to get a clean DataFrame
daily_df = daily_avg.reset_index()
daily_df.columns = ['sensor_id', 'date', 'avg_temp']
print(daily_df.head())


# (b) 6-hour windows with multiple aggregations
six_hour = (
    sensor_ts.groupby('sensor_id')
    .resample('6h')['temperature']
    .agg(['min', 'max', 'mean'])
    .round(1)
)
print("\n6-hour temperature summary:")
print(six_hour.head(8))
# Shows 4 windows per day: 00:00-05:59, 06:00-11:59, 12:00-17:59, 18:00-23:59


# (c) Weekly temperature range
weekly_range = (
    sensor_ts.groupby('sensor_id')
    .resample('W')['temperature']
    .agg(temp_min='min', temp_max='max')
)
weekly_range['temp_range'] = weekly_range['temp_max'] - weekly_range['temp_min']
print("\nWeekly temperature range:")
print(weekly_range.round(1))

# SQL equivalent:
# SELECT sensor_id,
#        DATE_TRUNC('day', timestamp) AS day,
#        AVG(temperature) AS avg_temp
# FROM sensor_data
# GROUP BY sensor_id, DATE_TRUNC('day', timestamp)
# ORDER BY sensor_id, day

Challenge 2: date_range() and Filling Time Gaps

📝
Problem: You have sensor data with some missing timestamps. (a) Create a complete hourly date range and identify missing timestamps. (b) Fill missing values using forward fill (carry last known value). (c) Fill missing values using interpolation (linear between known points). Compare the two approaches.

Solution

# Create data with intentional gaps
sparse_data = sensor_data[sensor_data['sensor_id'] == 'S1'].copy()
# Drop 10 random rows to simulate missing readings
drop_idx = np.random.choice(sparse_data.index, size=10, replace=False)
sparse_data = sparse_data.drop(drop_idx)
print(f"Original rows: {len(sensor_data[sensor_data['sensor_id'] == 'S1'])}")
print(f"Sparse rows: {len(sparse_data)}")

# (a) Create complete date range and identify gaps
full_range = pd.date_range(
    start=sparse_data['timestamp'].min(),
    end=sparse_data['timestamp'].max(),
    freq='h'
)

# Reindex to complete range
sparse_ts = sparse_data.set_index('timestamp')
complete = sparse_ts.reindex(full_range)
complete.index.name = 'timestamp'

missing = complete[complete['temperature'].isna()]
print(f"\nMissing timestamps: {len(missing)}")
print(missing.index.tolist()[:5])  # Show first 5 missing timestamps


# (b) Forward fill: carry last known value
ffilled = complete.copy()
ffilled['temperature'] = ffilled['temperature'].ffill()
ffilled['humidity'] = ffilled['humidity'].ffill()
ffilled['fill_method'] = 'ffill'

# Back fill for any remaining NaN at the start
ffilled['temperature'] = ffilled['temperature'].bfill()
ffilled['humidity'] = ffilled['humidity'].bfill()


# (c) Linear interpolation: estimate between known points
interpolated = complete.copy()
interpolated['temperature'] = interpolated['temperature'].interpolate(method='linear')
interpolated['humidity'] = interpolated['humidity'].interpolate(method='linear')
interpolated['fill_method'] = 'interpolation'

# Compare the two approaches for a specific gap
print("\nComparison for filled values:")
comparison = pd.DataFrame({
    'original': complete['temperature'],
    'ffill': ffilled['temperature'],
    'interpolated': interpolated['temperature']
})
# Show rows around a gap
gap_area = comparison[comparison['original'].isna()].head(3)
print(gap_area.round(1))

# Key difference:
# ffill: uses the last known value (flat line) - good for categorical or step data
# interpolation: estimates linearly between points - good for continuous measurements
print("\nUse ffill for: status codes, categories, prices (last traded)")
print("Use interpolation for: temperature, pressure, continuous sensor readings")

Challenge 3: Time Zone Handling

📝
Problem: You receive data from sensors in different time zones. (a) Localize naive timestamps to UTC. (b) Convert UTC timestamps to US/Eastern and US/Pacific. (c) Merge data from sensors in different time zones, aligning by UTC time, then report results in each local time zone.

Solution

# (a) Localize naive timestamps to UTC
# "Naive" = no timezone info; "Aware" = has timezone info
utc_data = sensor_data.copy()
utc_data['timestamp_utc'] = utc_data['timestamp'].dt.tz_localize('UTC')
print(f"Naive: {utc_data['timestamp'].iloc[0]}")
print(f"UTC:   {utc_data['timestamp_utc'].iloc[0]}")


# (b) Convert to different time zones
utc_data['timestamp_eastern'] = utc_data['timestamp_utc'].dt.tz_convert('US/Eastern')
utc_data['timestamp_pacific'] = utc_data['timestamp_utc'].dt.tz_convert('US/Pacific')

print(f"\nSame moment in time:")
print(f"UTC:     {utc_data['timestamp_utc'].iloc[0]}")
print(f"Eastern: {utc_data['timestamp_eastern'].iloc[0]}")  # -5 hours
print(f"Pacific: {utc_data['timestamp_pacific'].iloc[0]}")  # -8 hours


# (c) Merge sensors from different time zones
# Sensor A is in New York, Sensor B is in Los Angeles
sensor_a = pd.DataFrame({
    'timestamp': pd.to_datetime(['2024-03-01 10:00', '2024-03-01 11:00',
                                  '2024-03-01 12:00']),
    'value_a': [22.5, 23.1, 23.8]
})
sensor_a['timestamp'] = sensor_a['timestamp'].dt.tz_localize('US/Eastern')

sensor_b = pd.DataFrame({
    'timestamp': pd.to_datetime(['2024-03-01 07:00', '2024-03-01 08:00',
                                  '2024-03-01 09:00']),
    'value_b': [18.2, 18.9, 19.5]
})
sensor_b['timestamp'] = sensor_b['timestamp'].dt.tz_localize('US/Pacific')

# Convert both to UTC for alignment
sensor_a['utc'] = sensor_a['timestamp'].dt.tz_convert('UTC')
sensor_b['utc'] = sensor_b['timestamp'].dt.tz_convert('UTC')

# Merge on UTC time
merged = pd.merge(sensor_a, sensor_b, on='utc', suffixes=('_eastern', '_pacific'))
print("\nMerged on UTC alignment:")
print(merged[['utc', 'value_a', 'value_b']])
#                        utc  value_a  value_b
# 0 2024-03-01 15:00:00+00:00    22.5     18.2
# 1 2024-03-01 16:00:00+00:00    23.1     18.9
# 2 2024-03-01 17:00:00+00:00    23.8     19.5

# 10:00 Eastern = 07:00 Pacific = 15:00 UTC (all the same moment)
💡
Data leakage warning: In ML, timezone mishandling is a common source of data leakage. If your training data is in UTC but your prediction target is in local time, you might accidentally use future data. Always standardize to UTC internally and convert to local time only for display purposes.

Challenge 4: Rolling Windows on Date-Indexed Data

📝
Problem: (a) Compute a 24-hour rolling average temperature (calendar-based, not row-based). (b) Compute the rolling 3-day max temperature. (c) For each sensor, find the hour of day with the highest average temperature across all days.

Solution

# Set timestamp as index for time-based rolling
ts = sensor_data.set_index('timestamp')

# (a) 24-hour rolling average (calendar-based window)
# Using '24h' ensures the window is time-based, not row-count-based
ts['rolling_24h_avg'] = (
    ts.groupby('sensor_id')['temperature']
    .transform(lambda x: x.rolling('24h').mean())
).round(1)

print(ts[ts.index.normalize() == '2024-03-02'][
    ['sensor_id', 'temperature', 'rolling_24h_avg']
].head(6))


# (b) 3-day rolling maximum
ts['rolling_3d_max'] = (
    ts.groupby('sensor_id')['temperature']
    .transform(lambda x: x.rolling('3D').max())
).round(1)

# Compare row-based vs time-based rolling:
# rolling(24)    = exactly 24 rows (works only if data is regular)
# rolling('24h') = all rows within 24 hours (handles irregular data)


# (c) Hour-of-day analysis: find peak temperature hour
sensor_data['hour'] = sensor_data['timestamp'].dt.hour

hourly_profile = (
    sensor_data.groupby(['sensor_id', 'hour'])['temperature']
    .mean()
    .round(1)
)

# Find peak hour per sensor
peak_hours = hourly_profile.groupby('sensor_id').idxmax()
print("\nPeak temperature hour per sensor:")
for sensor, (_, hour) in peak_hours.items():
    avg_temp = hourly_profile.loc[(sensor, hour)]
    print(f"  {sensor}: Hour {hour}:00 (avg {avg_temp} C)")

# Full daily profile for visualization
daily_profile = hourly_profile.unstack(level='sensor_id')
print("\nHourly temperature profile:")
print(daily_profile.head(8))
# hour   S1     S2
# 0     17.8   18.2
# 1     16.5   17.1
# ...
# 6     20.1   20.5
# 7     22.3   22.8

Challenge 5: Lag Features for Time Series ML

📝
Problem: Build a feature matrix for predicting the next hour's temperature. Include: (a) lag features (t-1, t-2, t-3 temperatures), (b) rolling statistics (3-hour mean, std), (c) time-based features (hour of day, day of week, is_weekend), and (d) handle the NaN rows created by lag/rolling operations.

Solution

# Work with a single sensor for clarity
s1 = sensor_data[sensor_data['sensor_id'] == 'S1'].copy()
s1 = s1.set_index('timestamp').sort_index()

# (a) Lag features
for lag in [1, 2, 3]:
    s1[f'temp_lag_{lag}'] = s1['temperature'].shift(lag)

# (b) Rolling statistics (past 3 hours, NOT including current)
s1['temp_rolling_mean_3h'] = s1['temperature'].shift(1).rolling(3).mean().round(2)
s1['temp_rolling_std_3h'] = s1['temperature'].shift(1).rolling(3).std().round(2)
s1['humidity_rolling_mean_3h'] = s1['humidity'].shift(1).rolling(3).mean().round(2)

# (c) Time-based features
s1['hour'] = s1.index.hour
s1['day_of_week'] = s1.index.dayofweek  # 0=Monday, 6=Sunday
s1['is_weekend'] = (s1['day_of_week'] >= 5).astype(int)

# Cyclical encoding for hour (so hour 23 is close to hour 0)
s1['hour_sin'] = np.sin(2 * np.pi * s1['hour'] / 24).round(4)
s1['hour_cos'] = np.cos(2 * np.pi * s1['hour'] / 24).round(4)

# (d) Target variable: next hour's temperature
s1['target'] = s1['temperature'].shift(-1)  # What we want to predict

# Drop rows with NaN (from lags and rolling)
feature_cols = [
    'temp_lag_1', 'temp_lag_2', 'temp_lag_3',
    'temp_rolling_mean_3h', 'temp_rolling_std_3h',
    'humidity_rolling_mean_3h',
    'hour_sin', 'hour_cos', 'day_of_week', 'is_weekend'
]

ml_data = s1[feature_cols + ['target']].dropna()
print(f"Feature matrix shape: {ml_data.shape}")
print(f"Rows dropped due to NaN: {len(s1) - len(ml_data)}")

print("\nFeature matrix sample:")
print(ml_data.head())

# Train/test split (time-based, NOT random!)
train_end = '2024-03-05'
X_train = ml_data.loc[:train_end, feature_cols]
y_train = ml_data.loc[:train_end, 'target']
X_test = ml_data.loc[train_end:, feature_cols]
y_test = ml_data.loc[train_end:, 'target']

print(f"\nTrain: {X_train.shape[0]} rows ({X_train.index.min().date()} to {X_train.index.max().date()})")
print(f"Test:  {X_test.shape[0]} rows ({X_test.index.min().date()} to {X_test.index.max().date()})")

# CRITICAL: Never use random train/test split for time series!
# Always split by time to prevent data leakage.
Common interview mistake: Using sklearn.model_selection.train_test_split on time series data. This randomly shuffles rows, causing future data to leak into the training set. Always use a temporal cutoff: train on past data, test on future data. Mention this proactively in interviews to show you understand data leakage.