Intermediate

Research Coding Round

The research coding round evaluates whether you can translate mathematical ideas into working code, implement papers from scratch, design rigorous experiments, and write reproducible research code. Unlike software engineering interviews, the focus is not on algorithmic puzzles but on ML implementation fluency.

What Research Coding Interviews Test

Research coding interviews at top AI labs differ fundamentally from software engineering coding interviews. Here is what interviewers are actually evaluating:

Paper Implementation

Can you read a paper's method section and implement it correctly? This tests whether you understand the math deeply enough to translate equations into code, handle edge cases the paper glosses over, and debug when your implementation does not reproduce the paper's results.

PyTorch Fluency

Can you write custom nn.Module classes, implement training loops, handle gradient computation, use autograd correctly, and manipulate tensors efficiently? Interviewers expect you to write PyTorch code as fluently as a software engineer writes Python.

Experiment Design

Can you design a controlled experiment that isolates the effect of a specific change? This includes choosing baselines, ablation studies, proper evaluation metrics, statistical significance testing, and hyperparameter search strategies.

Debugging Intuition

When training fails, can you diagnose the problem? Interviewers may intentionally give you buggy code and ask you to find and fix issues. Common bugs: incorrect loss computation, gradient flow problems, shape mismatches, numerical instability.

Common Research Coding Questions

Q1: Implement Multi-Head Self-Attention from Scratch

💡

What they expect: Write a PyTorch nn.Module that implements multi-head self-attention as described in "Attention Is All You Need." Include the linear projections for Q, K, V, the scaled dot-product attention computation, the masking logic (for causal/padding masks), and the output projection.

Key implementation details to get right:

  • Split the d_model dimension across heads: d_k = d_model / num_heads
  • Scale attention scores by 1/sqrt(d_k) to prevent softmax saturation
  • Apply causal mask before softmax (set masked positions to -inf, not 0)
  • Reshape operations: (batch, seq_len, d_model) → (batch, num_heads, seq_len, d_k)
  • Use a single linear projection for all heads and then reshape, rather than separate projections per head (more efficient)

Common mistakes: Forgetting to scale by sqrt(d_k), applying mask after softmax instead of before, incorrect tensor reshaping that mixes up head and sequence dimensions, not handling variable-length sequences with padding masks.

Q2: Implement the DDPM Forward and Reverse Process

💡

What they expect: Implement the forward diffusion process (adding noise) and the reverse denoising step for a Denoising Diffusion Probabilistic Model.

Forward process: q(x_t | x_0) = N(x_t; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I). This allows sampling x_t directly from x_0 without iterating through all steps. Implement the noise schedule (linear or cosine beta schedule), compute alpha_bar_t cumulatively, and write the sampling function.

Training objective: Sample a random timestep t, add noise to get x_t, predict the noise with the model, and compute MSE loss between predicted and actual noise: L = E[||epsilon - epsilon_theta(x_t, t)||²].

Reverse process: x_{t-1} = (1/sqrt(alpha_t)) * (x_t - (beta_t / sqrt(1-alpha_bar_t)) * epsilon_theta(x_t, t)) + sigma_t * z, where z ~ N(0, I) for t > 1 and z = 0 for t = 1.

Key details: Correctly precomputing alpha_bar, handling the variance at t=1 (no noise added), and implementing the timestep embedding for the noise prediction network.

Q3: Implement a Training Loop with Proper Logging and Evaluation

💡

What they expect: Write a complete training loop that includes gradient accumulation, learning rate scheduling, periodic evaluation, checkpointing, and logging. This tests your understanding of practical training infrastructure.

Elements of a strong answer:

  • Gradient accumulation for effective batch sizes larger than GPU memory allows
  • Mixed precision training (torch.cuda.amp) with GradScaler
  • Learning rate warmup followed by cosine decay
  • Gradient clipping by global norm
  • Periodic evaluation on a held-out validation set with model.eval() and torch.no_grad()
  • Saving and loading checkpoints (model state_dict, optimizer state_dict, scheduler state, epoch, best metric)
  • Logging to wandb or tensorboard
  • Setting random seeds for reproducibility (random, numpy, torch, torch.cuda)

Experiment Design Principles

Interviewers often ask: "How would you set up an experiment to test hypothesis X?" Strong answers follow these principles:

PrincipleWhat It MeansCommon Violation
Controlled comparisonChange exactly one variable at a time between experimentsChanging the model architecture AND the learning rate simultaneously
Fair baselinesTune baselines with the same effort as your methodUsing default hyperparameters for baselines but tuning your method extensively
Multiple seedsReport mean and standard deviation across at least 3 random seedsReporting a single run that happened to get the best result
Ablation studiesRemove or modify each component to measure its individual contributionPresenting a method with 5 components but no ablation showing which ones matter
Compute-matched comparisonCompare methods at the same total compute budgetComparing a method trained for 100 epochs against a baseline trained for 10 epochs

Reproducibility Checklist

Research labs value reproducibility highly. In an interview, mentioning these practices signals professionalism:

💡
  • Random seeds: Set seeds for Python random, NumPy, PyTorch CPU and CUDA, and set torch.backends.cudnn.deterministic = True
  • Environment: Pin exact package versions (requirements.txt or conda environment.yml). Record PyTorch, CUDA, and cuDNN versions.
  • Hyperparameters: Log all hyperparameters to config files. Use tools like Hydra or argparse with full configs saved alongside results.
  • Data versioning: Hash your datasets. Use tools like DVC for data versioning. Document any preprocessing steps.
  • Compute details: Report GPU type, number of GPUs, total training time, and effective batch size.
  • Code versioning: Tag the exact git commit used for each experiment. Store configs, logs, and results together.

Code Quality for Research

Research code does not need to be production-quality, but it does need to be correct, readable, and modifiable. Here are the standards interviewers expect:

Correct First, Fast Second

Write a naive, clearly correct implementation first. Optimize only after you have verified correctness. In an interview, a correct but slow implementation beats a fast but buggy one every time.

Shape Annotations

Comment tensor shapes at key points: # (batch, seq_len, d_model). This prevents shape-related bugs and shows the interviewer you are thinking about the computation graph.

Modular Design

Separate the model, training loop, data loading, and evaluation into distinct functions or classes. This makes it easy to swap components for ablation studies.

Sanity Checks

Before training at scale, verify: (1) the model can overfit a single batch, (2) loss decreases on the training set, (3) gradients are not zero or infinite, (4) output shapes are correct throughout the pipeline.

Key Takeaways

💡
  • Research coding interviews test paper implementation, PyTorch fluency, experiment design, and debugging intuition — not LeetCode
  • Practice implementing core components from scratch: attention, diffusion, training loops, loss functions
  • Emphasize correctness over speed, use shape annotations, and build modular code
  • Know experiment design principles: controlled comparisons, fair baselines, multiple seeds, ablation studies
  • Mention reproducibility practices (seeds, environment pinning, config logging) to signal research maturity