Beginner

DL Interview Overview

Before diving into specific topics, understand what companies actually ask in deep learning interviews, the depth expected at different seniority levels, and how to structure your answers for maximum impact — whether on a whiteboard, shared screen, or in conversation.

What Companies Actually Ask

Deep learning interviews vary by company type, but they generally fall into three categories. Knowing which category your target company falls into shapes your preparation strategy.

Big Tech (Google, Meta, Amazon, Apple)

Format: 45–60 min rounds, typically 1–2 DL-specific rounds in an ML role loop. Expect a mix of theory questions and coding (implement a layer, write a training loop). They test breadth across architectures and depth on at least one topic.

Example questions: "Explain how multi-head attention works and write it in PyTorch." "Why does batch norm behave differently at train vs. inference time?" "Design a model for [real product feature]."

AI Startups (OpenAI, Anthropic, Cohere, Scale)

Format: Deeper technical dives, often 2–3 DL rounds. Expect questions on recent papers, scaling laws, and implementation details. They care about whether you have actually trained large models and debugged real issues.

Example questions: "Walk me through the diffusion process step by step." "How would you debug a model that trains fine on 1 GPU but diverges on 8 GPUs?" "What happens to loss when you double the model size?"

Applied ML Teams (Fintech, Healthcare, Autonomous)

Format: Focus on practical application. Fewer theory deep-dives, more "how would you solve this problem" scenarios. They want to see you can pick the right model for the right problem and handle real-world data issues.

Example questions: "We have 10K labeled medical images. How would you build a classifier?" "What's your approach to handling class imbalance in fraud detection?" "How do you decide between a CNN and a Transformer for this task?"

Depth Expected by Level

Level	Theory Depth	Coding Depth	What They're Really Testing
Junior / New Grad	Know the standard architectures (CNN, RNN, Transformer), activation functions, loss functions, and basic optimization	Implement a simple model, training loop, or a single layer in PyTorch	Do you have solid fundamentals? Can you learn quickly?
Mid-Level (3–5 yrs)	Understand trade-offs between architectures, explain why techniques work (not just what they do), discuss recent developments	Implement multi-head attention, custom loss functions, data loading pipelines	Have you actually built and debugged models? Can you make design decisions?
Senior / Staff (5+ yrs)	Deep understanding of scaling laws, training dynamics, numerical stability, distributed training, architecture search spaces	Design training infrastructure, implement complex architectures, optimize for production	Can you lead a modeling effort end-to-end? Can you mentor others?

How to Draw Architectures on a Whiteboard

Whiteboard explanations are a critical part of DL interviews. Here is a structured approach that works for any architecture.

💡

The 4-Step Whiteboard Framework:

Big picture first: Draw the high-level data flow (input → processing blocks → output) before any details. Label the input and output shapes.
Zoom into one block: Pick the most important block and expand it. For a Transformer, expand the self-attention mechanism. For a ResNet, expand the residual block.
Add the math: Write the key equations next to the diagram. For attention: Q, K, V projections and the softmax(QK^T/sqrt(d))V formula. Keep notation clean.
Discuss trade-offs: End by mentioning why this design choice was made and what alternatives exist. This is what separates good from great answers.

Example: Drawing a Transformer Block

Step 1 - Big Picture:
  Input (batch, seq_len, d_model)
      |
  [Multi-Head Attention] + Residual Connection
      |
  [Layer Norm]
      |
  [Feed-Forward Network] + Residual Connection
      |
  [Layer Norm]
      |
  Output (batch, seq_len, d_model)

Step 2 - Zoom into Multi-Head Attention:
  Input X (batch, seq_len, d_model)
      |
      |--- W_Q ---> Q (batch, heads, seq_len, d_k)
      |--- W_K ---> K (batch, heads, seq_len, d_k)
      |--- W_V ---> V (batch, heads, seq_len, d_k)
      |
  Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
      |
  Concat all heads --> W_O --> Output

Step 3 - Key Math:
  d_k = d_model / num_heads
  Attention scores: (seq_len x d_k) @ (d_k x seq_len) = (seq_len x seq_len)
  Memory: O(seq_len^2) -- this is why context length is expensive

Step 4 - Trade-offs:
  "Multi-head lets the model attend to different representation subspaces.
   Alternative: single-head with larger d_k has same parameter count
   but empirically performs worse. Recent work (GQA, MQA) shares K,V
   heads to reduce KV-cache memory at inference time."

What Distinguishes Strong vs. Weak Candidates

Strong Candidates

Explain why something works, not just what it does
Connect theory to practical experience: "I used this technique when..."
Acknowledge trade-offs and limitations without being asked
Write clean, runnable code — import statements, correct tensor shapes
Say "I don't know, but here's how I'd reason about it" when stuck
Ask clarifying questions about the problem before jumping to a solution

Weak Candidates

Memorize definitions without understanding the underlying intuition
Cannot write code that actually runs — wrong shapes, missing dimensions
Present everything as "the best" without discussing when it fails
Cannot connect different topics (e.g., how batch norm relates to training stability)
Panic when asked about something they have not seen before
Give textbook answers without any personal experience or insight

How to Structure Your Study Plan

With limited time, prioritize based on your target company and level. Here is a recommended order:

Week	Focus Area	What to Practice
Week 1	Neural Network Fundamentals	Activation functions, backpropagation, loss functions, regularization. Implement a 2-layer MLP in PyTorch from scratch.
Week 2	CNNs + RNNs	Conv layer math, pooling, ResNet skip connections, LSTM gates. Implement a CNN classifier and an LSTM text classifier.
Week 3	Transformers	Self-attention math, multi-head attention, positional encoding, BERT vs GPT. Implement self-attention from scratch.
Week 4	Training + Generative Models	Optimizers, learning rate schedules, mixed precision, GANs, diffusion. Focus on debugging scenarios.

Common Interview Formats

Rapid-Fire Theory (15 min)

Quick questions testing breadth. "What's the difference between L1 and L2 regularization?" "Why do we use ReLU instead of sigmoid?" Aim for 30–60 second answers. Conciseness matters.

Deep Dive (20–30 min)

One topic explored in depth. "Walk me through how a Transformer processes a sentence." Expect follow-up questions that go deeper. Use the whiteboard framework above.

Live Coding (30–45 min)

"Implement multi-head attention in PyTorch." You will write real code, typically in a shared editor. Practice writing PyTorch code without autocomplete or documentation.

Design Problem (45–60 min)

"Design a model to detect hate speech in images with text." Combines architecture choice, data strategy, training approach, and deployment considerations.

Key Takeaways

💡

Know your target: Big Tech tests breadth, AI startups test depth, applied teams test practical judgment
Always explain why, not just what — this is the single most important interview skill
Practice whiteboard drawing: big picture first, then zoom in, then add math, then discuss trade-offs
Write runnable PyTorch code with correct tensor shapes — practice without autocomplete
The remaining lessons in this course cover every major topic area with real Q&A pairs

Next → Neural Network Fundamentals