Data Pipeline Coding in Interviews
Data pipeline problems are the bread and butter of ML and data engineering interviews. This lesson covers what companies actually test, the patterns you need to know, and a systematic approach to solving pipeline coding challenges under time pressure.
Why Pipeline Coding Matters
Every ML model depends on clean, reliable data. Companies know that the most common cause of ML failures is not the model — it is bad data. That is why data pipeline coding has become a core interview topic for data engineers, ML engineers, and even backend engineers working with data-intensive systems.
ETL / ELT
Extract-Transform-Load is the foundation. Interviewers test your ability to parse formats, flatten structures, map schemas, and handle incremental loads.
Data Quality
Validation, null handling, deduplication, and integrity checks. Companies lose millions from bad data — they want engineers who catch problems early.
Streaming
Windowed aggregation, event ordering, late arrivals, and sessionization. Real-time systems require different thinking than batch pipelines.
What Companies Actually Test
Pipeline interview questions vary by role and company. Here is what to expect:
| Company Type | Difficulty | Focus Areas | Format |
|---|---|---|---|
| Google / Meta DE | Medium-Hard | ETL design, streaming, schema evolution | Colab or whiteboard |
| Amazon Data | Medium | Batch processing, validation, S3 pipelines | Live coding |
| Fintech / Trading | Hard | Streaming, low latency, data quality | System design + code |
| Startups | Easy-Medium | CSV/JSON parsing, basic ETL, validation | Take-home or live |
| ML Platform Teams | Medium-Hard | Feature pipelines, caching, performance | Design + implementation |
Core Pipeline Patterns
Most pipeline interview questions reduce to combinations of these fundamental patterns:
# Pattern 1: Extract → Transform → Load
def etl_pipeline(source_path, target_path):
"""Classic ETL: read raw data, transform, write clean output."""
raw_data = extract(source_path) # Read from source
clean_data = transform(raw_data) # Apply business logic
load(clean_data, target_path) # Write to destination
# Pattern 2: Validate → Route
def validate_and_route(records):
"""Split records into valid and invalid streams."""
valid, invalid = [], []
for record in records:
if passes_all_checks(record):
valid.append(record)
else:
invalid.append(record)
return valid, invalid
# Pattern 3: Windowed Aggregation
def sliding_window_aggregate(events, window_size):
"""Compute running aggregates over a sliding window."""
from collections import deque
window = deque()
results = []
for event in events:
window.append(event)
while window and event['ts'] - window[0]['ts'] > window_size:
window.popleft()
results.append(aggregate(window))
return results
# Pattern 4: Incremental Processing
def incremental_load(source, state):
"""Process only new/changed records since last run."""
last_watermark = state.get('last_watermark', 0)
new_records = [r for r in source if r['updated_at'] > last_watermark]
process(new_records)
state['last_watermark'] = max(r['updated_at'] for r in new_records)
return state
The 5-Step Framework for Pipeline Problems
Use this approach for every pipeline challenge in this course and in real interviews:
| Step | Time | What to Do |
|---|---|---|
| 1. Clarify the Data | 2 min | Ask about format (CSV, JSON, Avro), volume, schema, and edge cases (nulls, duplicates, encoding). |
| 2. Identify the Pattern | 1 min | Is this ETL, validation, streaming, or feature engineering? Name the pattern explicitly. |
| 3. Define the Contract | 2 min | State input format, output format, error handling strategy, and idempotency requirements. |
| 4. Implement | 10 min | Write clean Python. Use generators for memory efficiency. Handle edge cases explicitly. |
| 5. Test & Optimize | 5 min | Write test cases for happy path, edge cases, and failure modes. Discuss scalability. |
Tools and Libraries You Should Know
Pipeline interviews rarely require specific framework knowledge, but mentioning these shows awareness:
| Category | Tools | When to Mention |
|---|---|---|
| Batch ETL | Apache Spark, dbt, Airflow | Large-scale data transformations |
| Streaming | Kafka, Flink, Spark Streaming | Real-time event processing |
| Validation | Great Expectations, Pydantic, Cerberus | Data quality and schema enforcement |
| Feature Stores | Feast, Tecton, Hopsworks | ML feature engineering pipelines |
| Orchestration | Airflow, Prefect, Dagster | Pipeline scheduling and dependency management |
Quick Self-Assessment
Before starting, try this problem without looking at the solution:
import json
def flatten_record(record, parent_key='', sep='_'):
"""Flatten a nested dictionary into a single-level dict.
Example:
{'user': {'name': 'Alice', 'age': 30}, 'score': 95}
becomes
{'user_name': 'Alice', 'user_age': 30, 'score': 95}
"""
items = []
for key, value in record.items():
new_key = f"{parent_key}{sep}{key}" if parent_key else key
if isinstance(value, dict):
items.extend(flatten_record(value, new_key, sep).items())
elif isinstance(value, list):
# Convert lists to JSON strings for SQL compatibility
items.append((new_key, json.dumps(value)))
else:
items.append((new_key, value))
return dict(items)
# Test
records = [
{'user': {'name': 'Alice', 'address': {'city': 'NYC', 'zip': '10001'}}, 'score': 95},
{'user': {'name': 'Bob', 'address': {'city': 'LA', 'zip': '90001'}}, 'score': 88},
]
flat = [flatten_record(r) for r in records]
for row in flat:
print(row)
# {'user_name': 'Alice', 'user_address_city': 'NYC', 'user_address_zip': '10001', 'score': 95}
# {'user_name': 'Bob', 'user_address_city': 'LA', 'user_address_zip': '90001', 'score': 88}
Why this matters: JSON flattening is one of the most common ETL operations. APIs return nested JSON, but data warehouses need flat tables. This exact function appears in production pipelines at virtually every data-driven company.
Course Overview
Each lesson in this course follows a consistent structure:
- Problem statement — Clear description of input, expected output, and constraints
- Dataset setup — Realistic data you can paste into any Python environment
- Solution with explanation — Complete code with line-by-line commentary
- Complexity analysis — Time and space complexity for each solution
- Production considerations — How the solution would change at scale
Lilly Tech Systems