DL Interview Overview
Before diving into specific topics, understand what companies actually ask in deep learning interviews, the depth expected at different seniority levels, and how to structure your answers for maximum impact — whether on a whiteboard, shared screen, or in conversation.
What Companies Actually Ask
Deep learning interviews vary by company type, but they generally fall into three categories. Knowing which category your target company falls into shapes your preparation strategy.
Big Tech (Google, Meta, Amazon, Apple)
Format: 45–60 min rounds, typically 1–2 DL-specific rounds in an ML role loop. Expect a mix of theory questions and coding (implement a layer, write a training loop). They test breadth across architectures and depth on at least one topic.
Example questions: "Explain how multi-head attention works and write it in PyTorch." "Why does batch norm behave differently at train vs. inference time?" "Design a model for [real product feature]."
AI Startups (OpenAI, Anthropic, Cohere, Scale)
Format: Deeper technical dives, often 2–3 DL rounds. Expect questions on recent papers, scaling laws, and implementation details. They care about whether you have actually trained large models and debugged real issues.
Example questions: "Walk me through the diffusion process step by step." "How would you debug a model that trains fine on 1 GPU but diverges on 8 GPUs?" "What happens to loss when you double the model size?"
Applied ML Teams (Fintech, Healthcare, Autonomous)
Format: Focus on practical application. Fewer theory deep-dives, more "how would you solve this problem" scenarios. They want to see you can pick the right model for the right problem and handle real-world data issues.
Example questions: "We have 10K labeled medical images. How would you build a classifier?" "What's your approach to handling class imbalance in fraud detection?" "How do you decide between a CNN and a Transformer for this task?"
Depth Expected by Level
| Level | Theory Depth | Coding Depth | What They're Really Testing |
|---|---|---|---|
| Junior / New Grad | Know the standard architectures (CNN, RNN, Transformer), activation functions, loss functions, and basic optimization | Implement a simple model, training loop, or a single layer in PyTorch | Do you have solid fundamentals? Can you learn quickly? |
| Mid-Level (3–5 yrs) | Understand trade-offs between architectures, explain why techniques work (not just what they do), discuss recent developments | Implement multi-head attention, custom loss functions, data loading pipelines | Have you actually built and debugged models? Can you make design decisions? |
| Senior / Staff (5+ yrs) | Deep understanding of scaling laws, training dynamics, numerical stability, distributed training, architecture search spaces | Design training infrastructure, implement complex architectures, optimize for production | Can you lead a modeling effort end-to-end? Can you mentor others? |
How to Draw Architectures on a Whiteboard
Whiteboard explanations are a critical part of DL interviews. Here is a structured approach that works for any architecture.
- Big picture first: Draw the high-level data flow (input → processing blocks → output) before any details. Label the input and output shapes.
- Zoom into one block: Pick the most important block and expand it. For a Transformer, expand the self-attention mechanism. For a ResNet, expand the residual block.
- Add the math: Write the key equations next to the diagram. For attention: Q, K, V projections and the softmax(QK^T/sqrt(d))V formula. Keep notation clean.
- Discuss trade-offs: End by mentioning why this design choice was made and what alternatives exist. This is what separates good from great answers.
Example: Drawing a Transformer Block
Step 1 - Big Picture:
Input (batch, seq_len, d_model)
|
[Multi-Head Attention] + Residual Connection
|
[Layer Norm]
|
[Feed-Forward Network] + Residual Connection
|
[Layer Norm]
|
Output (batch, seq_len, d_model)
Step 2 - Zoom into Multi-Head Attention:
Input X (batch, seq_len, d_model)
|
|--- W_Q ---> Q (batch, heads, seq_len, d_k)
|--- W_K ---> K (batch, heads, seq_len, d_k)
|--- W_V ---> V (batch, heads, seq_len, d_k)
|
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
|
Concat all heads --> W_O --> Output
Step 3 - Key Math:
d_k = d_model / num_heads
Attention scores: (seq_len x d_k) @ (d_k x seq_len) = (seq_len x seq_len)
Memory: O(seq_len^2) -- this is why context length is expensive
Step 4 - Trade-offs:
"Multi-head lets the model attend to different representation subspaces.
Alternative: single-head with larger d_k has same parameter count
but empirically performs worse. Recent work (GQA, MQA) shares K,V
heads to reduce KV-cache memory at inference time."
What Distinguishes Strong vs. Weak Candidates
Strong Candidates
- Explain why something works, not just what it does
- Connect theory to practical experience: "I used this technique when..."
- Acknowledge trade-offs and limitations without being asked
- Write clean, runnable code — import statements, correct tensor shapes
- Say "I don't know, but here's how I'd reason about it" when stuck
- Ask clarifying questions about the problem before jumping to a solution
Weak Candidates
- Memorize definitions without understanding the underlying intuition
- Cannot write code that actually runs — wrong shapes, missing dimensions
- Present everything as "the best" without discussing when it fails
- Cannot connect different topics (e.g., how batch norm relates to training stability)
- Panic when asked about something they have not seen before
- Give textbook answers without any personal experience or insight
How to Structure Your Study Plan
With limited time, prioritize based on your target company and level. Here is a recommended order:
| Week | Focus Area | What to Practice |
|---|---|---|
| Week 1 | Neural Network Fundamentals | Activation functions, backpropagation, loss functions, regularization. Implement a 2-layer MLP in PyTorch from scratch. |
| Week 2 | CNNs + RNNs | Conv layer math, pooling, ResNet skip connections, LSTM gates. Implement a CNN classifier and an LSTM text classifier. |
| Week 3 | Transformers | Self-attention math, multi-head attention, positional encoding, BERT vs GPT. Implement self-attention from scratch. |
| Week 4 | Training + Generative Models | Optimizers, learning rate schedules, mixed precision, GANs, diffusion. Focus on debugging scenarios. |
Common Interview Formats
Rapid-Fire Theory (15 min)
Quick questions testing breadth. "What's the difference between L1 and L2 regularization?" "Why do we use ReLU instead of sigmoid?" Aim for 30–60 second answers. Conciseness matters.
Deep Dive (20–30 min)
One topic explored in depth. "Walk me through how a Transformer processes a sentence." Expect follow-up questions that go deeper. Use the whiteboard framework above.
Live Coding (30–45 min)
"Implement multi-head attention in PyTorch." You will write real code, typically in a shared editor. Practice writing PyTorch code without autocomplete or documentation.
Design Problem (45–60 min)
"Design a model to detect hate speech in images with text." Combines architecture choice, data strategy, training approach, and deployment considerations.
Key Takeaways
- Know your target: Big Tech tests breadth, AI startups test depth, applied teams test practical judgment
- Always explain why, not just what — this is the single most important interview skill
- Practice whiteboard drawing: big picture first, then zoom in, then add math, then discuss trade-offs
- Write runnable PyTorch code with correct tensor shapes — practice without autocomplete
- The remaining lessons in this course cover every major topic area with real Q&A pairs
Lilly Tech Systems