Research Coding Round
The research coding round evaluates whether you can translate mathematical ideas into working code, implement papers from scratch, design rigorous experiments, and write reproducible research code. Unlike software engineering interviews, the focus is not on algorithmic puzzles but on ML implementation fluency.
What Research Coding Interviews Test
Research coding interviews at top AI labs differ fundamentally from software engineering coding interviews. Here is what interviewers are actually evaluating:
Paper Implementation
Can you read a paper's method section and implement it correctly? This tests whether you understand the math deeply enough to translate equations into code, handle edge cases the paper glosses over, and debug when your implementation does not reproduce the paper's results.
PyTorch Fluency
Can you write custom nn.Module classes, implement training loops, handle gradient computation, use autograd correctly, and manipulate tensors efficiently? Interviewers expect you to write PyTorch code as fluently as a software engineer writes Python.
Experiment Design
Can you design a controlled experiment that isolates the effect of a specific change? This includes choosing baselines, ablation studies, proper evaluation metrics, statistical significance testing, and hyperparameter search strategies.
Debugging Intuition
When training fails, can you diagnose the problem? Interviewers may intentionally give you buggy code and ask you to find and fix issues. Common bugs: incorrect loss computation, gradient flow problems, shape mismatches, numerical instability.
Common Research Coding Questions
Q1: Implement Multi-Head Self-Attention from Scratch
What they expect: Write a PyTorch nn.Module that implements multi-head self-attention as described in "Attention Is All You Need." Include the linear projections for Q, K, V, the scaled dot-product attention computation, the masking logic (for causal/padding masks), and the output projection.
Key implementation details to get right:
- Split the d_model dimension across heads: d_k = d_model / num_heads
- Scale attention scores by 1/sqrt(d_k) to prevent softmax saturation
- Apply causal mask before softmax (set masked positions to -inf, not 0)
- Reshape operations: (batch, seq_len, d_model) → (batch, num_heads, seq_len, d_k)
- Use a single linear projection for all heads and then reshape, rather than separate projections per head (more efficient)
Common mistakes: Forgetting to scale by sqrt(d_k), applying mask after softmax instead of before, incorrect tensor reshaping that mixes up head and sequence dimensions, not handling variable-length sequences with padding masks.
Q2: Implement the DDPM Forward and Reverse Process
What they expect: Implement the forward diffusion process (adding noise) and the reverse denoising step for a Denoising Diffusion Probabilistic Model.
Forward process: q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I). This allows sampling x_t directly from x_0 without iterating through all steps. Implement the noise schedule (linear or cosine beta schedule), compute alpha_bar_t cumulatively, and write the sampling function.
Training objective: Sample a random timestep t, add noise to get x_t, predict the noise with the model, and compute MSE loss between predicted and actual noise: L = E[||epsilon - epsilon_theta(x_t, t)||²].
Reverse process: x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (beta_t / sqrt(1-alpha_bar_t)) * epsilon_theta(x_t, t)) + sigma_t * z, where z ~ N(0, I) for t > 1 and z = 0 for t = 1.
Key details: Correctly precomputing alpha_bar, handling the variance at t=1 (no noise added), and implementing the timestep embedding for the noise prediction network.
Q3: Implement a Training Loop with Proper Logging and Evaluation
What they expect: Write a complete training loop that includes gradient accumulation, learning rate scheduling, periodic evaluation, checkpointing, and logging. This tests your understanding of practical training infrastructure.
Elements of a strong answer:
- Gradient accumulation for effective batch sizes larger than GPU memory allows
- Mixed precision training (torch.cuda.amp) with GradScaler
- Learning rate warmup followed by cosine decay
- Gradient clipping by global norm
- Periodic evaluation on a held-out validation set with model.eval() and torch.no_grad()
- Saving and loading checkpoints (model state_dict, optimizer state_dict, scheduler state, epoch, best metric)
- Logging to wandb or tensorboard
- Setting random seeds for reproducibility (random, numpy, torch, torch.cuda)
Experiment Design Principles
Interviewers often ask: "How would you set up an experiment to test hypothesis X?" Strong answers follow these principles:
| Principle | What It Means | Common Violation |
|---|---|---|
| Controlled comparison | Change exactly one variable at a time between experiments | Changing the model architecture AND the learning rate simultaneously |
| Fair baselines | Tune baselines with the same effort as your method | Using default hyperparameters for baselines but tuning your method extensively |
| Multiple seeds | Report mean and standard deviation across at least 3 random seeds | Reporting a single run that happened to get the best result |
| Ablation studies | Remove or modify each component to measure its individual contribution | Presenting a method with 5 components but no ablation showing which ones matter |
| Compute-matched comparison | Compare methods at the same total compute budget | Comparing a method trained for 100 epochs against a baseline trained for 10 epochs |
Reproducibility Checklist
Research labs value reproducibility highly. In an interview, mentioning these practices signals professionalism:
- Random seeds: Set seeds for Python random, NumPy, PyTorch CPU and CUDA, and set torch.backends.cudnn.deterministic = True
- Environment: Pin exact package versions (requirements.txt or conda environment.yml). Record PyTorch, CUDA, and cuDNN versions.
- Hyperparameters: Log all hyperparameters to config files. Use tools like Hydra or argparse with full configs saved alongside results.
- Data versioning: Hash your datasets. Use tools like DVC for data versioning. Document any preprocessing steps.
- Compute details: Report GPU type, number of GPUs, total training time, and effective batch size.
- Code versioning: Tag the exact git commit used for each experiment. Store configs, logs, and results together.
Code Quality for Research
Research code does not need to be production-quality, but it does need to be correct, readable, and modifiable. Here are the standards interviewers expect:
Correct First, Fast Second
Write a naive, clearly correct implementation first. Optimize only after you have verified correctness. In an interview, a correct but slow implementation beats a fast but buggy one every time.
Shape Annotations
Comment tensor shapes at key points: # (batch, seq_len, d_model). This prevents shape-related bugs and shows the interviewer you are thinking about the computation graph.
Modular Design
Separate the model, training loop, data loading, and evaluation into distinct functions or classes. This makes it easy to swap components for ablation studies.
Sanity Checks
Before training at scale, verify: (1) the model can overfit a single batch, (2) loss decreases on the training set, (3) gradients are not zero or infinite, (4) output shapes are correct throughout the pipeline.
Key Takeaways
- Research coding interviews test paper implementation, PyTorch fluency, experiment design, and debugging intuition — not LeetCode
- Practice implementing core components from scratch: attention, diffusion, training loops, loss functions
- Emphasize correctness over speed, use shape annotations, and build modular code
- Know experiment design principles: controlled comparisons, fair baselines, multiple seeds, ablation studies
- Mention reproducibility practices (seeds, environment pinning, config logging) to signal research maturity
Lilly Tech Systems