ML System Design Round
The ML system design round is where senior candidates differentiate themselves. This is not about knowing the right model — it is about designing an end-to-end system that works reliably in production at scale.
The 4-Step Framework
Use this framework for every ML system design question. Interviewers expect you to drive the conversation through these phases:
Clarify
Define the problem precisely. What is the business objective? Who are the users? What are the constraints (latency, scale, budget)? What data is available? What is the success metric?
Design
Draw the high-level architecture. Data pipeline, feature engineering, model selection, training infrastructure, serving layer, and monitoring. Start simple, then add complexity.
Deep Dive
Go deep on 2–3 critical components. The interviewer will guide you toward their area of interest. Show depth in feature engineering, model architecture, or serving infrastructure.
Trade-offs
Discuss alternatives, limitations, and how you would iterate. What would you do differently with more data, time, or compute? What could go wrong and how would you detect it?
Walkthrough 1: Design a Recommendation System
Prompt: "Design a movie recommendation system for a streaming platform with 100 million users and 50,000 titles."
Step 1: Clarify
Questions you should ask:
- "What is the primary business metric? Watch time? Subscriptions? Content diversity?"
- "What data do we have? Watch history, ratings, browsing behavior, demographics?"
- "What is the latency requirement? How fast must recommendations load?"
- "Should we personalize for new users (cold start)?"
Assumptions: Primary metric is watch time. We have watch history, ratings, and browsing data. Latency requirement is <200ms. Cold start is important (5% of users are new).
Step 2: High-Level Architecture
RECOMMENDATION SYSTEM ARCHITECTURE
====================================
[User Request] --> [API Gateway]
|
[Candidate Generation] -- 1000 candidates from 50K titles
|
[Feature Store] <-- [Feature Pipeline (batch + streaming)]
|
[Ranking Model] -- Score 1000 candidates
|
[Business Rules] -- Filter by region, age rating, licensing
|
[Re-ranking] -- Diversity, freshness, explore/exploit
|
[Response: Top 50]
Offline Components:
[Data Warehouse] --> [Training Pipeline] --> [Model Registry]
[Event Stream] --> [Feature Pipeline] --> [Feature Store]
[A/B Testing Framework] --> [Metrics Dashboard]
Step 3: Deep Dive — Two-Stage Architecture
Stage 1: Candidate Generation
- Collaborative Filtering: Matrix factorization (ALS) on user-item interaction matrix. Produces user and item embeddings. Fast approximate nearest neighbor search (FAISS) retrieves top 500 candidates in <10ms.
- Content-Based: TF-IDF or BERT embeddings of movie metadata (genre, cast, plot). Retrieves 300 similar items to recent watches.
- Popularity-Based: Trending titles and new releases add 200 candidates for diversity and cold start.
- Union of all candidates: ~1000 unique titles.
Stage 2: Ranking Model
- Model: Gradient-boosted decision tree (LightGBM) or deep ranking model (DCN-v2).
- Features: User features (watch history embeddings, demographics, time-of-day), item features (genre, release year, popularity, average rating), cross features (user-genre affinity, historical CTR for this user-item pair).
- Label: Binary — did the user watch more than 70% of the title? (proxy for engagement)
- Training: Retrained daily on the last 30 days of data. Evaluated on next-day holdout.
Step 4: Trade-offs
| Decision | Alternative | Why We Chose This |
|---|---|---|
| Two-stage (candidate gen + ranking) | Single model scoring all 50K items | Scoring all 50K items per request at 100M users is computationally infeasible |
| LightGBM ranker | Deep neural ranker (DCN-v2) | LightGBM is faster to train, easier to debug, and competitive on tabular features |
| Daily retraining | Real-time online learning | Daily is sufficient for content recommendations; online learning adds complexity |
| Watch 70% as positive label | Click as positive, explicit rating | Clicks are noisy (clickbait); ratings are sparse; watch completion signals true engagement |
Walkthrough 2: Design a Fraud Detection System
Prompt: "Design a real-time fraud detection system for a payment platform processing 10,000 transactions per second."
Step 1: Clarify
Primary metric: minimize financial loss while keeping false positive rate under 1% (to avoid blocking legitimate transactions). Latency: decision must be made within 100ms. Data: transaction amount, merchant, location, device, user history.
Step 2: Architecture
FRAUD DETECTION ARCHITECTURE
==============================
[Transaction] --> [Real-time Feature Engine]
|
Features:
- Transaction: amount, merchant category, country
- Velocity: txns in last 1h/24h/7d for this user
- Device: fingerprint, IP geolocation, is_new_device
- Behavioral: time since last txn, amount deviation
|
[Model Ensemble]
├── Rules Engine (hard rules: amount > $10K, blocked countries)
├── LightGBM (tabular features, fast inference)
└── Neural Network (sequence of recent transactions)
|
[Decision Engine]
├── score > 0.9 --> BLOCK (auto-decline)
├── 0.5 < score < 0.9 --> REVIEW (manual queue)
└── score < 0.5 --> APPROVE
|
[Feedback Loop]
- Chargebacks (delayed label, 30-90 days)
- Manual review outcomes (same-day label)
- Customer reports (real-time label)
Step 3: Deep Dive — Real-Time Feature Engineering
The most critical component. Fraud patterns depend heavily on velocity features (how many transactions in the last hour) and deviation features (is this amount unusual for this user).
- Streaming aggregations: Use Apache Flink or Kafka Streams to maintain sliding window counters per user (count, sum, avg in last 1h, 24h, 7d).
- Feature store: Redis for low-latency feature retrieval (<5ms). Precomputed features updated in near-real-time.
- Graph features: Shared device/IP/email across accounts (fraud rings). Computed in batch, refreshed hourly.
Step 4: Key Trade-offs
- Latency vs. accuracy: More features and larger models improve accuracy but increase latency. We use a lightweight model for real-time scoring and a heavier model for the review queue.
- Label delay: Chargebacks arrive 30–90 days after the transaction. We use manual review outcomes for faster feedback and retrain weekly.
- Adversarial adaptation: Fraudsters adapt to the model. We monitor model performance daily and retrain with the latest fraud patterns.
Walkthrough 3: Design a Search Ranking System
Prompt: "Design the search ranking system for an e-commerce platform with 500 million products."
Step 1: Clarify
Metric: revenue per search (combination of click-through rate, conversion rate, and order value). Latency: <200ms end-to-end. Must handle typos, synonyms, and multi-language queries.
Step 2: Architecture
SEARCH RANKING ARCHITECTURE
==============================
[User Query: "wireless headphones under $50"]
|
[Query Understanding]
├── Spell correction ("wireles" → "wireless")
├── Query expansion (synonyms: "earbuds", "earphones")
├── Intent classification (product search vs. brand search)
└── Entity extraction (category: headphones, price: <$50)
|
[Retrieval: Elasticsearch / Inverted Index]
└── BM25 + semantic search (bi-encoder embeddings)
└── Returns top 1000 candidates
|
[Feature Engineering]
├── Query-product relevance (BM25 score, semantic similarity)
├── Product quality (reviews, rating, return rate)
├── User personalization (past purchases, click history)
└── Business signals (margin, inventory, promoted)
|
[Learning-to-Rank Model: LambdaMART]
└── Scores and re-ranks 1000 candidates
|
[Business Rules + Diversity]
├── Remove out-of-stock items
├── Apply price filter
└── Ensure brand diversity in top 10
|
[Return Top 48 Results]
Step 3: Deep Dive — Learning to Rank
Model choice: LambdaMART (gradient-boosted trees optimized for ranking metrics like NDCG). Advantages: handles tabular features well, fast inference, interpretable feature importance.
Training data: Click logs with position bias correction. A click at position 10 is more valuable than a click at position 1 (the user scrolled past 9 other results). Use inverse propensity scoring to debias.
Key features (ranked by importance):
- Query-title BM25 score
- Semantic similarity (bi-encoder cosine distance)
- Historical CTR for this product
- Product review score (weighted by recency)
- Price competitiveness (percentile within category)
- User-product affinity (based on purchase history)
Step 4: Trade-offs
- BM25 vs. semantic search: BM25 handles exact keyword matches well but misses synonyms. Semantic search (dense retrieval) captures meaning but is computationally expensive. We use a hybrid: BM25 for recall, semantic reranking for precision.
- Optimizing for clicks vs. revenue: Optimizing for clicks may surface cheap, clickable products. Optimizing for revenue may surface expensive products that do not convert. We use a blended metric: 0.6 * conversion_rate + 0.3 * revenue_per_click + 0.1 * CTR.
- Online vs. offline evaluation: NDCG on offline holdout data is our primary offline metric. A/B tests measure revenue per search online. We only ship models that win on both.
Lilly Tech Systems