Implementing Papers
Turn paper descriptions and equations into working code — the ultimate way to deeply understand AI research.
From Equation to Code
The core skill: translating mathematical notation into PyTorch or NumPy operations.
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V, mask=None): """ Q: (batch, heads, seq_len, d_k) K: (batch, heads, seq_len, d_k) V: (batch, heads, seq_len, d_v) """ d_k = Q.size(-1) # QK^T / sqrt(d_k) scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) # Optional mask (for decoder self-attention) if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) # softmax attention_weights = F.softmax(scores, dim=-1) # Multiply by V output = torch.matmul(attention_weights, V) return output, attention_weights
Step-by-Step Implementation Process
Identify the Core Algorithm
Find the main contribution. It's usually in Section 3 (Methods). Look for pseudocode, algorithm boxes, or key equations.
Check for Official Code
Search Papers With Code, the paper's footnotes, or the authors' GitHub profiles. Official code is the ground truth.
Start with a Minimal Version
Don't implement everything at once. Build the simplest version first, test it, then add complexity (multi-head attention, layer norm, etc.).
Verify with Known Results
Run on a small dataset where you know the expected output. Compare with the paper's reported numbers on standard benchmarks.
Common Math-to-Code Patterns
| Math Notation | PyTorch/NumPy Code |
|---|---|
| Matrix multiply: AB | torch.matmul(A, B) or A @ B |
| Element-wise multiply: a * b | a * b |
| Transpose: A^T | A.T or A.transpose(-2, -1) |
| Summation: Σx_i | torch.sum(x, dim=0) |
| Softmax: exp(x_i) / Σexp(x_j) | F.softmax(x, dim=-1) |
| L2 norm: ||x||_2 | torch.norm(x, p=2) |
| Concatenation: [a; b] | torch.cat([a, b], dim=-1) |
| Gradient: ∇L | loss.backward() (autograd) |
Debugging Implementations
- Shape mismatches: Print tensor shapes at every step. Most bugs are dimension errors.
- Gradient issues: Use
torch.autograd.gradcheckto verify custom backward passes. - Numerical stability: Log-sum-exp trick for softmax, epsilon values to avoid division by zero.
- Hyperparameters: Use the exact values from the paper's appendix. Small differences in learning rate or batch size can prevent reproduction.
- Random seeds: Set all random seeds for reproducibility:
torch.manual_seed(42).
Resources for Implementation
Papers With Code
Find existing implementations, benchmarks, and datasets for nearly any ML paper.
Annotated Implementations
nn.labml.ai provides line-by-line annotated PyTorch implementations of major papers.
Yannic Kilcher
YouTube channel with paper explanations that bridge the gap between paper and understanding.
The Illustrated Series
Jay Alammar's "Illustrated Transformer" and similar posts provide visual explanations of key architectures.