Beginner

Data Pipeline Coding in Interviews

Data pipeline problems are the bread and butter of ML and data engineering interviews. This lesson covers what companies actually test, the patterns you need to know, and a systematic approach to solving pipeline coding challenges under time pressure.

Why Pipeline Coding Matters

Every ML model depends on clean, reliable data. Companies know that the most common cause of ML failures is not the model — it is bad data. That is why data pipeline coding has become a core interview topic for data engineers, ML engineers, and even backend engineers working with data-intensive systems.

🔃

ETL / ELT

Extract-Transform-Load is the foundation. Interviewers test your ability to parse formats, flatten structures, map schemas, and handle incremental loads.

Data Quality

Validation, null handling, deduplication, and integrity checks. Companies lose millions from bad data — they want engineers who catch problems early.

Streaming

Windowed aggregation, event ordering, late arrivals, and sessionization. Real-time systems require different thinking than batch pipelines.

What Companies Actually Test

Pipeline interview questions vary by role and company. Here is what to expect:

Company TypeDifficultyFocus AreasFormat
Google / Meta DEMedium-HardETL design, streaming, schema evolutionColab or whiteboard
Amazon DataMediumBatch processing, validation, S3 pipelinesLive coding
Fintech / TradingHardStreaming, low latency, data qualitySystem design + code
StartupsEasy-MediumCSV/JSON parsing, basic ETL, validationTake-home or live
ML Platform TeamsMedium-HardFeature pipelines, caching, performanceDesign + implementation

Core Pipeline Patterns

Most pipeline interview questions reduce to combinations of these fundamental patterns:

# Pattern 1: Extract → Transform → Load
def etl_pipeline(source_path, target_path):
    """Classic ETL: read raw data, transform, write clean output."""
    raw_data = extract(source_path)       # Read from source
    clean_data = transform(raw_data)      # Apply business logic
    load(clean_data, target_path)         # Write to destination

# Pattern 2: Validate → Route
def validate_and_route(records):
    """Split records into valid and invalid streams."""
    valid, invalid = [], []
    for record in records:
        if passes_all_checks(record):
            valid.append(record)
        else:
            invalid.append(record)
    return valid, invalid

# Pattern 3: Windowed Aggregation
def sliding_window_aggregate(events, window_size):
    """Compute running aggregates over a sliding window."""
    from collections import deque
    window = deque()
    results = []
    for event in events:
        window.append(event)
        while window and event['ts'] - window[0]['ts'] > window_size:
            window.popleft()
        results.append(aggregate(window))
    return results

# Pattern 4: Incremental Processing
def incremental_load(source, state):
    """Process only new/changed records since last run."""
    last_watermark = state.get('last_watermark', 0)
    new_records = [r for r in source if r['updated_at'] > last_watermark]
    process(new_records)
    state['last_watermark'] = max(r['updated_at'] for r in new_records)
    return state

The 5-Step Framework for Pipeline Problems

Use this approach for every pipeline challenge in this course and in real interviews:

StepTimeWhat to Do
1. Clarify the Data2 minAsk about format (CSV, JSON, Avro), volume, schema, and edge cases (nulls, duplicates, encoding).
2. Identify the Pattern1 minIs this ETL, validation, streaming, or feature engineering? Name the pattern explicitly.
3. Define the Contract2 minState input format, output format, error handling strategy, and idempotency requirements.
4. Implement10 minWrite clean Python. Use generators for memory efficiency. Handle edge cases explicitly.
5. Test & Optimize5 minWrite test cases for happy path, edge cases, and failure modes. Discuss scalability.
💡
Interview tip: When an interviewer asks a pipeline question, always clarify three things before coding: (1) What is the expected data volume? (2) Is this batch or streaming? (3) What should happen when data is invalid? These questions show you think like a production engineer.

Tools and Libraries You Should Know

Pipeline interviews rarely require specific framework knowledge, but mentioning these shows awareness:

CategoryToolsWhen to Mention
Batch ETLApache Spark, dbt, AirflowLarge-scale data transformations
StreamingKafka, Flink, Spark StreamingReal-time event processing
ValidationGreat Expectations, Pydantic, CerberusData quality and schema enforcement
Feature StoresFeast, Tecton, HopsworksML feature engineering pipelines
OrchestrationAirflow, Prefect, DagsterPipeline scheduling and dependency management

Quick Self-Assessment

Before starting, try this problem without looking at the solution:

📝
Problem: Given a list of JSON records where each record may have nested fields, write a function that flattens all records into a flat dictionary format suitable for loading into a SQL table. Handle nested dicts by joining keys with underscores.
import json

def flatten_record(record, parent_key='', sep='_'):
    """Flatten a nested dictionary into a single-level dict.

    Example:
        {'user': {'name': 'Alice', 'age': 30}, 'score': 95}
        becomes
        {'user_name': 'Alice', 'user_age': 30, 'score': 95}
    """
    items = []
    for key, value in record.items():
        new_key = f"{parent_key}{sep}{key}" if parent_key else key
        if isinstance(value, dict):
            items.extend(flatten_record(value, new_key, sep).items())
        elif isinstance(value, list):
            # Convert lists to JSON strings for SQL compatibility
            items.append((new_key, json.dumps(value)))
        else:
            items.append((new_key, value))
    return dict(items)

# Test
records = [
    {'user': {'name': 'Alice', 'address': {'city': 'NYC', 'zip': '10001'}}, 'score': 95},
    {'user': {'name': 'Bob', 'address': {'city': 'LA', 'zip': '90001'}}, 'score': 88},
]

flat = [flatten_record(r) for r in records]
for row in flat:
    print(row)
# {'user_name': 'Alice', 'user_address_city': 'NYC', 'user_address_zip': '10001', 'score': 95}
# {'user_name': 'Bob', 'user_address_city': 'LA', 'user_address_zip': '90001', 'score': 88}

Why this matters: JSON flattening is one of the most common ETL operations. APIs return nested JSON, but data warehouses need flat tables. This exact function appears in production pipelines at virtually every data-driven company.

📝
How to use this course: Type out every solution yourself. Modify the inputs and re-solve. After finishing a lesson, try solving each challenge again without looking at the answer. Each lesson has 5 practical challenges with complete solutions and complexity analysis.

Course Overview

Each lesson in this course follows a consistent structure:

  • Problem statement — Clear description of input, expected output, and constraints
  • Dataset setup — Realistic data you can paste into any Python environment
  • Solution with explanation — Complete code with line-by-line commentary
  • Complexity analysis — Time and space complexity for each solution
  • Production considerations — How the solution would change at scale