Beginner

Data Pipeline Coding in Interviews

Data pipeline problems are the bread and butter of ML and data engineering interviews. This lesson covers what companies actually test, the patterns you need to know, and a systematic approach to solving pipeline coding challenges under time pressure.

Why Pipeline Coding Matters

Every ML model depends on clean, reliable data. Companies know that the most common cause of ML failures is not the model — it is bad data. That is why data pipeline coding has become a core interview topic for data engineers, ML engineers, and even backend engineers working with data-intensive systems.

🔃

ETL / ELT

Extract-Transform-Load is the foundation. Interviewers test your ability to parse formats, flatten structures, map schemas, and handle incremental loads.

✅

Data Quality

Validation, null handling, deduplication, and integrity checks. Companies lose millions from bad data — they want engineers who catch problems early.

⚡

Streaming

Windowed aggregation, event ordering, late arrivals, and sessionization. Real-time systems require different thinking than batch pipelines.

What Companies Actually Test

Pipeline interview questions vary by role and company. Here is what to expect:

Company Type	Difficulty	Focus Areas	Format
Google / Meta DE	Medium-Hard	ETL design, streaming, schema evolution	Colab or whiteboard
Amazon Data	Medium	Batch processing, validation, S3 pipelines	Live coding
Fintech / Trading	Hard	Streaming, low latency, data quality	System design + code
Startups	Easy-Medium	CSV/JSON parsing, basic ETL, validation	Take-home or live
ML Platform Teams	Medium-Hard	Feature pipelines, caching, performance	Design + implementation

Core Pipeline Patterns

Most pipeline interview questions reduce to combinations of these fundamental patterns:

# Pattern 1: Extract → Transform → Load
def etl_pipeline(source_path, target_path):
    """Classic ETL: read raw data, transform, write clean output."""
    raw_data = extract(source_path)       # Read from source
    clean_data = transform(raw_data)      # Apply business logic
    load(clean_data, target_path)         # Write to destination

# Pattern 2: Validate → Route
def validate_and_route(records):
    """Split records into valid and invalid streams."""
    valid, invalid = [], []
    for record in records:
        if passes_all_checks(record):
            valid.append(record)
        else:
            invalid.append(record)
    return valid, invalid

# Pattern 3: Windowed Aggregation
def sliding_window_aggregate(events, window_size):
    """Compute running aggregates over a sliding window."""
    from collections import deque
    window = deque()
    results = []
    for event in events:
        window.append(event)
        while window and event['ts'] - window[0]['ts'] > window_size:
            window.popleft()
        results.append(aggregate(window))
    return results

# Pattern 4: Incremental Processing
def incremental_load(source, state):
    """Process only new/changed records since last run."""
    last_watermark = state.get('last_watermark', 0)
    new_records = [r for r in source if r['updated_at'] > last_watermark]
    process(new_records)
    state['last_watermark'] = max(r['updated_at'] for r in new_records)
    return state

The 5-Step Framework for Pipeline Problems

Use this approach for every pipeline challenge in this course and in real interviews:

Step	Time	What to Do
1. Clarify the Data	2 min	Ask about format (CSV, JSON, Avro), volume, schema, and edge cases (nulls, duplicates, encoding).
2. Identify the Pattern	1 min	Is this ETL, validation, streaming, or feature engineering? Name the pattern explicitly.
3. Define the Contract	2 min	State input format, output format, error handling strategy, and idempotency requirements.
4. Implement	10 min	Write clean Python. Use generators for memory efficiency. Handle edge cases explicitly.
5. Test & Optimize	5 min	Write test cases for happy path, edge cases, and failure modes. Discuss scalability.

💡

Interview tip: When an interviewer asks a pipeline question, always clarify three things before coding: (1) What is the expected data volume? (2) Is this batch or streaming? (3) What should happen when data is invalid? These questions show you think like a production engineer.

Tools and Libraries You Should Know

Pipeline interviews rarely require specific framework knowledge, but mentioning these shows awareness:

Category	Tools	When to Mention
Batch ETL	Apache Spark, dbt, Airflow	Large-scale data transformations
Streaming	Kafka, Flink, Spark Streaming	Real-time event processing
Validation	Great Expectations, Pydantic, Cerberus	Data quality and schema enforcement
Feature Stores	Feast, Tecton, Hopsworks	ML feature engineering pipelines
Orchestration	Airflow, Prefect, Dagster	Pipeline scheduling and dependency management

Quick Self-Assessment

Before starting, try this problem without looking at the solution:

📝

Problem: Given a list of JSON records where each record may have nested fields, write a function that flattens all records into a flat dictionary format suitable for loading into a SQL table. Handle nested dicts by joining keys with underscores.

import json

def flatten_record(record, parent_key='', sep='_'):
    """Flatten a nested dictionary into a single-level dict.

    Example:
        {'user': {'name': 'Alice', 'age': 30}, 'score': 95}
        becomes
        {'user_name': 'Alice', 'user_age': 30, 'score': 95}
    """
    items = []
    for key, value in record.items():
        new_key = f"{parent_key}{sep}{key}" if parent_key else key
        if isinstance(value, dict):
            items.extend(flatten_record(value, new_key, sep).items())
        elif isinstance(value, list):
            # Convert lists to JSON strings for SQL compatibility
            items.append((new_key, json.dumps(value)))
        else:
            items.append((new_key, value))
    return dict(items)

# Test
records = [
    {'user': {'name': 'Alice', 'address': {'city': 'NYC', 'zip': '10001'}}, 'score': 95},
    {'user': {'name': 'Bob', 'address': {'city': 'LA', 'zip': '90001'}}, 'score': 88},
]

flat = [flatten_record(r) for r in records]
for row in flat:
    print(row)
# {'user_name': 'Alice', 'user_address_city': 'NYC', 'user_address_zip': '10001', 'score': 95}
# {'user_name': 'Bob', 'user_address_city': 'LA', 'user_address_zip': '90001', 'score': 88}

Why this matters: JSON flattening is one of the most common ETL operations. APIs return nested JSON, but data warehouses need flat tables. This exact function appears in production pipelines at virtually every data-driven company.

📝

How to use this course: Type out every solution yourself. Modify the inputs and re-solve. After finishing a lesson, try solving each challenge again without looking at the answer. Each lesson has 5 practical challenges with complete solutions and complexity analysis.

Course Overview

Each lesson in this course follows a consistent structure:

Problem statement — Clear description of input, expected output, and constraints
Dataset setup — Realistic data you can paste into any Python environment
Solution with explanation — Complete code with line-by-line commentary
Complexity analysis — Time and space complexity for each solution
Production considerations — How the solution would change at scale

Next → ETL Challenges