Query Optimization
Writing correct SQL is step one. Writing SQL that runs efficiently on billions of rows is what separates production engineers from notebook practitioners. These 5 challenges cover the optimization topics that senior-level interviews test.
Challenge 1: Reading EXPLAIN Plans
Solution
-- The slow query
SELECT c.name, COUNT(*) AS order_count, SUM(o.amount) AS total_spent
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.name
ORDER BY total_spent DESC;
-- Run EXPLAIN ANALYZE (PostgreSQL)
EXPLAIN ANALYZE
SELECT c.name, COUNT(*), SUM(o.amount)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.name
ORDER BY SUM(o.amount) DESC;
-- Example output (simplified):
-- Sort (cost=1250..1255 rows=20)
-- Sort Key: sum(o.amount) DESC
-- -> HashAggregate (cost=1200..1220 rows=20)
-- Group Key: c.name
-- -> Hash Join (cost=50..1100 rows=5000)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Seq Scan on orders o (cost=0..950 rows=5000) << BOTTLENECK
-- Filter: (order_date >= '2024-01-01')
-- -> Hash (cost=25..25 rows=1000)
-- -> Seq Scan on customers c
-- ANALYSIS:
-- 1. "Seq Scan on orders" = full table scan. No index on order_date.
-- 2. Hash Join is fine for this size. Nested Loop would be bad.
-- 3. HashAggregate is efficient for GROUP BY with few groups.
-- FIX: Create an index on the filter column
CREATE INDEX idx_orders_date ON orders (order_date);
-- For covering index (even faster, avoids table lookup):
CREATE INDEX idx_orders_date_covering
ON orders (order_date, customer_id, amount);
-- After index: Seq Scan becomes Index Scan or Bitmap Index Scan
-- Expected improvement: 10-100x on large tables
Key EXPLAIN terms: Seq Scan = reads every row (bad for large tables with selective filters). Index Scan = uses an index (good). Bitmap Index Scan = batch index lookup (good for medium selectivity). Hash Join = builds hash table for join (good for medium tables). Nested Loop = for each outer row, scans inner table (good only with index on inner table).
Challenge 2: Indexing Strategy
Solution
-- Query Pattern 1: Find orders by customer (very frequent)
SELECT * FROM orders WHERE customer_id = 101;
-- Index: CREATE INDEX idx_orders_customer ON orders (customer_id);
-- Query Pattern 2: Find orders by date range + status (frequent)
SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31' AND status = 'completed';
-- Composite index: CREATE INDEX idx_orders_date_status ON orders (order_date, status);
-- Column order matters! Put the range column (order_date) first, equality column (status) second?
-- Actually, put EQUALITY first for better selectivity:
CREATE INDEX idx_orders_status_date ON orders (status, order_date);
-- Query Pattern 3: Aggregation by category (daily report)
SELECT category, SUM(amount) FROM orders GROUP BY category;
-- Covering index: CREATE INDEX idx_orders_cat_amt ON orders (category, amount);
-- "Covering" means the index contains all needed columns - no table lookup required.
-- Query Pattern 4: Find latest order per customer
SELECT DISTINCT ON (customer_id) * FROM orders ORDER BY customer_id, order_date DESC;
-- Composite index: CREATE INDEX idx_orders_cust_date ON orders (customer_id, order_date DESC);
-- ANTI-PATTERNS: Indexes to AVOID
-- 1. Don't index boolean columns (low cardinality = full scan is faster)
-- 2. Don't index columns in small tables (< 1000 rows)
-- 3. Don't create redundant indexes (idx on (A, B) covers queries on (A) alone)
-- 4. Don't over-index write-heavy tables (each INSERT/UPDATE must update all indexes)
-- RULE OF THUMB:
-- Read-heavy tables: more indexes OK (5-8)
-- Write-heavy tables: minimal indexes (2-3)
-- Each index costs ~1-5% on INSERT/UPDATE performance
(A, B, C), queries can use it for filters on (A), (A, B), or (A, B, C), but NOT for (B) alone or (B, C). This is called the "leftmost prefix" rule. In interviews, always explain this when discussing composite indexes.Challenge 3: Query Rewriting for Performance
Solution
-- SLOW: Correlated subquery (runs once per row)
SELECT *
FROM orders o
WHERE amount > (
SELECT AVG(amount) FROM orders WHERE category = o.category
);
-- FAST: Window function (single pass)
SELECT *
FROM (
SELECT *, AVG(amount) OVER (PARTITION BY category) AS cat_avg
FROM orders
) t
WHERE amount > cat_avg;
-- Why faster: One table scan instead of N scans (one per row).
-- ========================================
-- SLOW: SELECT DISTINCT with large result set
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= '2024-01-01';
-- FAST: EXISTS (stops at first match per group)
SELECT customer_id
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders WHERE customer_id = c.customer_id AND order_date >= '2024-01-01'
);
-- Or use GROUP BY:
SELECT customer_id FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id;
-- ========================================
-- SLOW: OR conditions that prevent index use
SELECT * FROM orders WHERE customer_id = 101 OR order_date = '2024-01-15';
-- FAST: UNION to allow index use on each condition
SELECT * FROM orders WHERE customer_id = 101
UNION
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Why faster: Each SELECT can use its own index. OR often forces a full table scan.
-- ========================================
-- SLOW: Function on indexed column (kills index)
SELECT * FROM orders WHERE YEAR(order_date) = 2024;
-- FAST: Range condition (preserves index)
SELECT * FROM orders WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
-- Why: Applying a function to a column prevents index use (called "sargability").
-- Always keep the indexed column "bare" on one side of the comparison.
Challenge 4: Avoiding the N+1 Query Problem
Solution
-- N+1 PATTERN (pseudo-code showing what the application does):
-- Query 1: SELECT * FROM customers; -- 1 query
-- For each customer:
-- Query N: SELECT * FROM orders WHERE customer_id = ?; -- N queries
-- Total: 1 + N queries (if 1000 customers = 1001 queries!)
-- FIX 1: Single JOIN query
SELECT
c.customer_id,
c.name,
c.email,
o.order_id,
o.amount,
o.order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_date;
-- Total: 1 query. Network round trips: 1 instead of 1001.
-- FIX 2: If you need aggregated data, not individual orders
SELECT
c.customer_id,
c.name,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.amount), 0) AS total_spent,
MAX(o.order_date) AS last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC;
-- FIX 3: Batch IN query (when you can't use JOIN)
-- Instead of N separate queries:
SELECT * FROM orders WHERE customer_id IN (101, 102, 103, ...);
-- Total: 2 queries instead of N+1.
-- PERFORMANCE COMPARISON:
-- Assume 1000 customers, 10ms per query, 5ms network latency:
-- N+1: 1001 * (10ms + 5ms) = 15 seconds
-- Single: 1 * (50ms + 5ms) = 55 milliseconds
-- Speedup: ~270x
Key insight: The N+1 problem is the most common performance issue in web applications. ORMs (Django, SQLAlchemy, ActiveRecord) often create N+1 queries silently with lazy loading. Always check the query log in development. Use select_related (Django) or eager_loading (SQLAlchemy) to generate JOINs automatically.
Challenge 5: Table Partitioning for Large Datasets
Solution
-- Strategy: Range partition by month on event_date
-- PostgreSQL declarative partitioning:
CREATE TABLE events (
event_id BIGINT,
user_id INT,
event_type VARCHAR(50),
event_data JSONB,
event_date TIMESTAMP NOT NULL
) PARTITION BY RANGE (event_date);
-- Create monthly partitions
CREATE TABLE events_2024_01 PARTITION OF events
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE events_2024_02 PARTITION OF events
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
CREATE TABLE events_2024_03 PARTITION OF events
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- ... create one per month for 3 years = 36 partitions
-- Create indexes on each partition (automatically applied)
CREATE INDEX idx_events_user ON events (user_id);
CREATE INDEX idx_events_type_date ON events (event_type, event_date);
-- HOW IT HELPS:
-- Without partitioning:
EXPLAIN SELECT * FROM events WHERE event_date BETWEEN '2024-03-01' AND '2024-03-31';
-- Seq Scan on events (rows=5,000,000,000) -- scans ALL 5B rows!
-- With partitioning:
-- The planner uses "partition pruning" to only scan the March partition
-- Index Scan on events_2024_03 (rows=~140,000,000) -- scans only 1 month!
-- That's ~36x fewer rows scanned.
-- PARTITION MAINTENANCE (automated with pg_partman):
-- 1. Create future partitions ahead of time
-- 2. Detach and archive old partitions: ALTER TABLE events DETACH PARTITION events_2022_01;
-- 3. DROP old partitions is instant (vs DELETE which is very slow)
-- WHEN TO PARTITION:
-- Table > 100M rows AND queries almost always filter on the partition key
-- NOT worth it for small tables or queries that span all partitions
-- OTHER PARTITION TYPES:
-- RANGE: dates, IDs (most common)
-- LIST: region, status, category (finite set of values)
-- HASH: uniform distribution when no natural range key
Lilly Tech Systems