Advanced

Query Optimization

Writing correct SQL is step one. Writing SQL that runs efficiently on billions of rows is what separates production engineers from notebook practitioners. These 5 challenges cover the optimization topics that senior-level interviews test.

💡
Why this matters for ML: ML feature pipelines process terabytes of data. A poorly written query in a dbt model or Airflow DAG can take hours instead of minutes. Understanding EXPLAIN plans, indexing, and partitioning is essential for data engineers building the infrastructure that ML depends on.

Challenge 1: Reading EXPLAIN Plans

📝
Problem: You have a slow query joining orders with customers. Use EXPLAIN to understand the query plan, identify the bottleneck, and propose a fix. Walk through what each node in the plan means.

Solution

-- The slow query
SELECT c.name, COUNT(*) AS order_count, SUM(o.amount) AS total_spent
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.name
ORDER BY total_spent DESC;

-- Run EXPLAIN ANALYZE (PostgreSQL)
EXPLAIN ANALYZE
SELECT c.name, COUNT(*), SUM(o.amount)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.name
ORDER BY SUM(o.amount) DESC;

-- Example output (simplified):
-- Sort  (cost=1250..1255 rows=20)
--   Sort Key: sum(o.amount) DESC
--   -> HashAggregate  (cost=1200..1220 rows=20)
--       Group Key: c.name
--       -> Hash Join  (cost=50..1100 rows=5000)
--           Hash Cond: (o.customer_id = c.customer_id)
--           -> Seq Scan on orders o  (cost=0..950 rows=5000)   << BOTTLENECK
--               Filter: (order_date >= '2024-01-01')
--           -> Hash  (cost=25..25 rows=1000)
--               -> Seq Scan on customers c

-- ANALYSIS:
-- 1. "Seq Scan on orders" = full table scan. No index on order_date.
-- 2. Hash Join is fine for this size. Nested Loop would be bad.
-- 3. HashAggregate is efficient for GROUP BY with few groups.

-- FIX: Create an index on the filter column
CREATE INDEX idx_orders_date ON orders (order_date);

-- For covering index (even faster, avoids table lookup):
CREATE INDEX idx_orders_date_covering
    ON orders (order_date, customer_id, amount);

-- After index: Seq Scan becomes Index Scan or Bitmap Index Scan
-- Expected improvement: 10-100x on large tables

Key EXPLAIN terms: Seq Scan = reads every row (bad for large tables with selective filters). Index Scan = uses an index (good). Bitmap Index Scan = batch index lookup (good for medium selectivity). Hash Join = builds hash table for join (good for medium tables). Nested Loop = for each outer row, scans inner table (good only with index on inner table).

Challenge 2: Indexing Strategy

📝
Problem: You have an e-commerce database with these common query patterns. Design the optimal indexing strategy. Explain which indexes to create and why, and which to avoid.

Solution

-- Query Pattern 1: Find orders by customer (very frequent)
SELECT * FROM orders WHERE customer_id = 101;
-- Index: CREATE INDEX idx_orders_customer ON orders (customer_id);

-- Query Pattern 2: Find orders by date range + status (frequent)
SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31' AND status = 'completed';
-- Composite index: CREATE INDEX idx_orders_date_status ON orders (order_date, status);
-- Column order matters! Put the range column (order_date) first, equality column (status) second?
-- Actually, put EQUALITY first for better selectivity:
CREATE INDEX idx_orders_status_date ON orders (status, order_date);

-- Query Pattern 3: Aggregation by category (daily report)
SELECT category, SUM(amount) FROM orders GROUP BY category;
-- Covering index: CREATE INDEX idx_orders_cat_amt ON orders (category, amount);
-- "Covering" means the index contains all needed columns - no table lookup required.

-- Query Pattern 4: Find latest order per customer
SELECT DISTINCT ON (customer_id) * FROM orders ORDER BY customer_id, order_date DESC;
-- Composite index: CREATE INDEX idx_orders_cust_date ON orders (customer_id, order_date DESC);

-- ANTI-PATTERNS: Indexes to AVOID
-- 1. Don't index boolean columns (low cardinality = full scan is faster)
-- 2. Don't index columns in small tables (< 1000 rows)
-- 3. Don't create redundant indexes (idx on (A, B) covers queries on (A) alone)
-- 4. Don't over-index write-heavy tables (each INSERT/UPDATE must update all indexes)

-- RULE OF THUMB:
-- Read-heavy tables: more indexes OK (5-8)
-- Write-heavy tables: minimal indexes (2-3)
-- Each index costs ~1-5% on INSERT/UPDATE performance
Composite index column order: For a composite index (A, B, C), queries can use it for filters on (A), (A, B), or (A, B, C), but NOT for (B) alone or (B, C). This is called the "leftmost prefix" rule. In interviews, always explain this when discussing composite indexes.

Challenge 3: Query Rewriting for Performance

📝
Problem: Rewrite these slow queries into faster equivalents. Explain why the original is slow and the rewrite is faster.

Solution

-- SLOW: Correlated subquery (runs once per row)
SELECT *
FROM orders o
WHERE amount > (
    SELECT AVG(amount) FROM orders WHERE category = o.category
);

-- FAST: Window function (single pass)
SELECT *
FROM (
    SELECT *, AVG(amount) OVER (PARTITION BY category) AS cat_avg
    FROM orders
) t
WHERE amount > cat_avg;
-- Why faster: One table scan instead of N scans (one per row).

-- ========================================

-- SLOW: SELECT DISTINCT with large result set
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2024-01-01';

-- FAST: EXISTS (stops at first match per group)
SELECT customer_id
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders WHERE customer_id = c.customer_id AND order_date >= '2024-01-01'
);
-- Or use GROUP BY:
SELECT customer_id FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id;

-- ========================================

-- SLOW: OR conditions that prevent index use
SELECT * FROM orders WHERE customer_id = 101 OR order_date = '2024-01-15';

-- FAST: UNION to allow index use on each condition
SELECT * FROM orders WHERE customer_id = 101
UNION
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Why faster: Each SELECT can use its own index. OR often forces a full table scan.

-- ========================================

-- SLOW: Function on indexed column (kills index)
SELECT * FROM orders WHERE YEAR(order_date) = 2024;

-- FAST: Range condition (preserves index)
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Why: Applying a function to a column prevents index use (called "sargability").
-- Always keep the indexed column "bare" on one side of the comparison.

Challenge 4: Avoiding the N+1 Query Problem

📝
Problem: An application fetches all customers, then loops through each to fetch their orders. This creates N+1 queries. Rewrite it as a single SQL query and explain the performance difference.

Solution

-- N+1 PATTERN (pseudo-code showing what the application does):
-- Query 1: SELECT * FROM customers;                          -- 1 query
-- For each customer:
--   Query N: SELECT * FROM orders WHERE customer_id = ?;     -- N queries
-- Total: 1 + N queries (if 1000 customers = 1001 queries!)

-- FIX 1: Single JOIN query
SELECT
    c.customer_id,
    c.name,
    c.email,
    o.order_id,
    o.amount,
    o.order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_date;
-- Total: 1 query. Network round trips: 1 instead of 1001.

-- FIX 2: If you need aggregated data, not individual orders
SELECT
    c.customer_id,
    c.name,
    COUNT(o.order_id) AS total_orders,
    COALESCE(SUM(o.amount), 0) AS total_spent,
    MAX(o.order_date) AS last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC;

-- FIX 3: Batch IN query (when you can't use JOIN)
-- Instead of N separate queries:
SELECT * FROM orders WHERE customer_id IN (101, 102, 103, ...);
-- Total: 2 queries instead of N+1.

-- PERFORMANCE COMPARISON:
-- Assume 1000 customers, 10ms per query, 5ms network latency:
-- N+1:      1001 * (10ms + 5ms) = 15 seconds
-- Single:   1 * (50ms + 5ms)    = 55 milliseconds
-- Speedup:  ~270x

Key insight: The N+1 problem is the most common performance issue in web applications. ORMs (Django, SQLAlchemy, ActiveRecord) often create N+1 queries silently with lazy loading. Always check the query log in development. Use select_related (Django) or eager_loading (SQLAlchemy) to generate JOINs automatically.

Challenge 5: Table Partitioning for Large Datasets

📝
Problem: You have an events table with 5 billion rows spanning 3 years. Queries almost always filter by date range. Design a partitioning strategy and explain how it improves query performance.

Solution

-- Strategy: Range partition by month on event_date

-- PostgreSQL declarative partitioning:
CREATE TABLE events (
    event_id BIGINT,
    user_id INT,
    event_type VARCHAR(50),
    event_data JSONB,
    event_date TIMESTAMP NOT NULL
) PARTITION BY RANGE (event_date);

-- Create monthly partitions
CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE events_2024_02 PARTITION OF events
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
CREATE TABLE events_2024_03 PARTITION OF events
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- ... create one per month for 3 years = 36 partitions

-- Create indexes on each partition (automatically applied)
CREATE INDEX idx_events_user ON events (user_id);
CREATE INDEX idx_events_type_date ON events (event_type, event_date);

-- HOW IT HELPS:
-- Without partitioning:
EXPLAIN SELECT * FROM events WHERE event_date BETWEEN '2024-03-01' AND '2024-03-31';
-- Seq Scan on events  (rows=5,000,000,000) -- scans ALL 5B rows!

-- With partitioning:
-- The planner uses "partition pruning" to only scan the March partition
-- Index Scan on events_2024_03  (rows=~140,000,000) -- scans only 1 month!
-- That's ~36x fewer rows scanned.

-- PARTITION MAINTENANCE (automated with pg_partman):
-- 1. Create future partitions ahead of time
-- 2. Detach and archive old partitions: ALTER TABLE events DETACH PARTITION events_2022_01;
-- 3. DROP old partitions is instant (vs DELETE which is very slow)

-- WHEN TO PARTITION:
-- Table > 100M rows AND queries almost always filter on the partition key
-- NOT worth it for small tables or queries that span all partitions

-- OTHER PARTITION TYPES:
-- RANGE: dates, IDs (most common)
-- LIST: region, status, category (finite set of values)
-- HASH: uniform distribution when no natural range key
💡
Interview tip: When discussing partitioning, mention these trade-offs: (1) Partition pruning only works when the partition key is in the WHERE clause. (2) Cross-partition queries (e.g., joining two partitioned tables) can be slower. (3) Too many partitions (thousands) create overhead. (4) Unique constraints must include the partition key.