RNN & Sequence Models
12 real interview questions on recurrent architectures. While Transformers have largely replaced RNNs, these questions still appear frequently because they test fundamental understanding of sequence modeling, gating mechanisms, and the historical motivation for attention.
Q1: How does a vanilla RNN work? What is the fundamental equation?
At each time step t, a vanilla RNN computes: h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h), where h_t is the hidden state, x_t is the input, W_hh is the recurrent weight matrix, and W_xh is the input weight matrix.
The output at each step: y_t = W_hy * h_t + b_y.
The key property is that the same weights W_hh, W_xh are shared across all time steps — this is what makes RNNs able to process sequences of arbitrary length. The hidden state h_t acts as a "memory" that carries information from previous time steps.
Limitation: In practice, vanilla RNNs can only effectively remember ~10-20 time steps due to vanishing gradients.
import torch
import torch.nn as nn
class VanillaRNNCell(nn.Module):
"""Manual implementation of a single RNN cell"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.W_xh = nn.Linear(input_size, hidden_size)
self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)
self.tanh = nn.Tanh()
def forward(self, x_t, h_prev):
# h_t = tanh(W_xh * x_t + W_hh * h_{t-1})
h_t = self.tanh(self.W_xh(x_t) + self.W_hh(h_prev))
return h_t
# Process a sequence manually
cell = VanillaRNNCell(input_size=10, hidden_size=20)
seq_len, batch_size = 5, 2
x = torch.randn(seq_len, batch_size, 10)
h = torch.zeros(batch_size, 20)
outputs = []
for t in range(seq_len):
h = cell(x[t], h)
outputs.append(h)
outputs = torch.stack(outputs) # (seq_len, batch, hidden_size)
Q2: Explain the vanishing gradient problem in RNNs. Why is it worse for RNNs than for feedforward networks?
During backpropagation through time (BPTT), the gradient of the loss with respect to a hidden state h_t involves a product of Jacobians: dL/dh_t = dL/dh_T * product(dh_{k+1}/dh_k, k=t to T-1). Each Jacobian dh_{k+1}/dh_k = diag(tanh'(z)) * W_hh.
Why it is worse for RNNs: In feedforward networks, each layer has its own weight matrix, so the product involves different matrices. In RNNs, the same matrix W_hh is multiplied repeatedly. If the largest eigenvalue of W_hh is < 1, the product shrinks exponentially. If > 1, it explodes exponentially. The tanh derivative (max 1.0, usually < 1) makes vanishing the more common problem.
Result: The network cannot learn long-range dependencies. If a word at position 5 is important for predicting a word at position 100, the gradient signal is essentially zero by the time it reaches position 5.
Q3: How does LSTM solve the vanishing gradient problem? Walk through all the gates.
LSTM introduces a cell state c_t that acts as a highway for gradient flow. Information flows through the cell state with only linear operations (addition and element-wise multiplication), avoiding the repeated nonlinear squashing that causes vanishing gradients.
The four components:
- Forget gate (f_t): f_t = sigmoid(W_f * [h_{t-1}, x_t] + b_f). Decides what to forget from the previous cell state. Values near 0 forget, near 1 remember.
- Input gate (i_t): i_t = sigmoid(W_i * [h_{t-1}, x_t] + b_i). Decides which new information to store.
- Candidate values: c_tilde = tanh(W_c * [h_{t-1}, x_t] + b_c). New candidate information to potentially add.
- Output gate (o_t): o_t = sigmoid(W_o * [h_{t-1}, x_t] + b_o). Decides what to output from the cell state.
Cell state update: c_t = f_t * c_{t-1} + i_t * c_tilde. This is the key — the gradient through c_t only involves element-wise multiplication by f_t (which is close to 1 for important information), avoiding the vanishing problem.
Hidden state: h_t = o_t * tanh(c_t).
import torch
import torch.nn as nn
class LSTMCell(nn.Module):
"""Manual LSTM cell implementation"""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# All four gates in one matrix multiply for efficiency
self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
def forward(self, x_t, h_prev, c_prev):
combined = torch.cat([h_prev, x_t], dim=-1)
gates = self.gates(combined)
# Split into four gates
i, f, g, o = gates.chunk(4, dim=-1)
i = torch.sigmoid(i) # Input gate
f = torch.sigmoid(f) # Forget gate
g = torch.tanh(g) # Candidate values
o = torch.sigmoid(o) # Output gate
c_t = f * c_prev + i * g # Cell state update
h_t = o * torch.tanh(c_t) # Hidden state
return h_t, c_t
# PyTorch built-in (use this in practice)
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2,
batch_first=True, dropout=0.2, bidirectional=True)
x = torch.randn(32, 50, 128) # (batch, seq_len, features)
output, (h_n, c_n) = lstm(x)
print(f"Output: {output.shape}") # [32, 50, 512] (256*2 for bidirectional)
Q4: How does GRU differ from LSTM? When would you prefer one over the other?
GRU (Gated Recurrent Unit) simplifies LSTM by merging the cell state and hidden state into a single state, and using two gates instead of three:
- Update gate (z_t): z_t = sigmoid(W_z * [h_{t-1}, x_t]). Combines LSTM's forget and input gates into one. Controls how much of the previous state to keep.
- Reset gate (r_t): r_t = sigmoid(W_r * [h_{t-1}, x_t]). Controls how much of the previous state to use when computing the candidate.
- State update: h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W * [r_t * h_{t-1}, x_t]).
GRU advantages: Fewer parameters (3 matrices vs. 4), faster training, often comparable performance.
When to use LSTM: When you need fine-grained control over memory (separate forget and input decisions), very long sequences, or when the task requires remembering information for many steps.
When to use GRU: When you have less data, want faster training, or the sequences are moderate length. Often the default choice unless LSTM is clearly better for your specific task.
In practice: The performance difference is usually small. Both are largely replaced by Transformers for most tasks today.
Q5: What is a bidirectional RNN? When is it appropriate and when is it not?
A bidirectional RNN processes the sequence in both directions (left-to-right and right-to-left) and concatenates the hidden states: h_t = [h_forward_t; h_backward_t]. This gives each position context from both the past and the future.
Appropriate: Tasks where the entire sequence is available at once — text classification, named entity recognition, speech recognition, BERT-style masked language modeling.
Not appropriate: Autoregressive generation (language models, text generation) where you generate one token at a time and cannot look at future tokens. Online/streaming tasks where future inputs are not available.
Common mistake: Using bidirectional RNNs for language generation. GPT-style models are unidirectional for a reason — at generation time, future tokens do not exist.
Q6: Explain the sequence-to-sequence (encoder-decoder) architecture.
Encoder: Processes the entire input sequence and compresses it into a fixed-size context vector (the final hidden state). For "How are you?" in translation, the encoder reads all 3 words and produces a vector summarizing the meaning.
Decoder: Takes the context vector and generates the output sequence one token at a time, autoregressively. Each output token is fed as input to the next decoder step.
The bottleneck problem: Compressing an entire input sequence into a single fixed-size vector loses information, especially for long sequences. This is the key motivation for attention mechanisms — instead of one vector, let the decoder look at all encoder hidden states.
import torch
import torch.nn as nn
class Seq2Seq(nn.Module):
def __init__(self, input_vocab, output_vocab, embed_dim, hidden_dim):
super().__init__()
self.encoder_embed = nn.Embedding(input_vocab, embed_dim)
self.decoder_embed = nn.Embedding(output_vocab, embed_dim)
self.encoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.decoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.output_proj = nn.Linear(hidden_dim, output_vocab)
def forward(self, src, tgt):
# Encode: process entire source sequence
src_emb = self.encoder_embed(src)
_, (h, c) = self.encoder(src_emb) # h is the context vector
# Decode: generate target sequence using context
tgt_emb = self.decoder_embed(tgt)
decoder_out, _ = self.decoder(tgt_emb, (h, c))
logits = self.output_proj(decoder_out)
return logits
model = Seq2Seq(input_vocab=5000, output_vocab=5000,
embed_dim=256, hidden_dim=512)
src = torch.randint(0, 5000, (2, 10)) # batch=2, src_len=10
tgt = torch.randint(0, 5000, (2, 15)) # batch=2, tgt_len=15
out = model(src, tgt)
print(f"Output: {out.shape}") # [2, 15, 5000]
Q7: What is the attention mechanism in the context of RNNs? How does Bahdanau attention work?
Motivation: Instead of compressing the entire input into one vector, let the decoder "attend" to different parts of the input at each decoding step. When translating "The cat sat on the mat," the decoder should focus on "cat" when generating the subject in the target language.
Bahdanau attention (additive attention):
- Compute alignment scores: e_ij = v^T * tanh(W_1 * h_decoder_i + W_2 * h_encoder_j) for each encoder position j
- Normalize to get attention weights: alpha_ij = softmax(e_ij) over all j
- Compute context vector: c_i = sum(alpha_ij * h_encoder_j)
- Combine context with decoder state for prediction
Key insight: This is a soft, differentiable form of "looking up" relevant information — a precursor to the attention mechanism in Transformers. The main difference is that Transformer attention uses dot-product similarity instead of additive scoring, and applies attention within a single sequence (self-attention) rather than between encoder and decoder.
Q8: What is teacher forcing? What problems does it cause?
Teacher forcing: During training, feed the ground-truth previous token as input to the decoder at each step, rather than the decoder's own prediction. This provides a stronger learning signal and speeds up convergence.
Problem — exposure bias: During training, the decoder always sees correct previous tokens. During inference, it sees its own predictions, which may contain errors. The decoder has never learned to recover from mistakes, so errors compound (one wrong token leads to more wrong tokens).
Solutions:
- Scheduled sampling: Gradually decrease the probability of using teacher forcing during training (e.g., start at 100%, decrease to 25%)
- Sequence-level training: Use REINFORCE or other RL methods to optimize sequence-level metrics (BLEU) directly
- Beam search at inference: Consider multiple candidate sequences instead of greedy decoding
Q9: How do you handle variable-length sequences in PyTorch?
Batching variable-length sequences requires padding shorter sequences to the maximum length and masking padded positions so they do not affect computation.
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
# Variable length sequences
seqs = [torch.randn(5, 10), torch.randn(3, 10), torch.randn(7, 10)]
lengths = [5, 3, 7]
# Pad to same length
padded = pad_sequence(seqs, batch_first=True, padding_value=0.0)
print(f"Padded shape: {padded.shape}") # [3, 7, 10]
# Sort by length (required for pack_padded_sequence)
lengths_tensor = torch.tensor(lengths)
sorted_lengths, sort_idx = lengths_tensor.sort(descending=True)
padded_sorted = padded[sort_idx]
# Pack - LSTM will skip padded positions
packed = pack_padded_sequence(padded_sorted, sorted_lengths, batch_first=True)
lstm = nn.LSTM(10, 20, batch_first=True)
packed_out, (h_n, c_n) = lstm(packed)
# Unpack back to padded tensor
output, output_lengths = pad_packed_sequence(packed_out, batch_first=True)
print(f"Output shape: {output.shape}") # [3, 7, 20]
# h_n contains the LAST (non-padded) hidden state for each sequence
# This is what you use for classification
Q10: Why have Transformers largely replaced RNNs? What advantages do RNNs still have?
Why Transformers won:
- Parallelization: RNNs process tokens sequentially (h_t depends on h_{t-1}). Transformers process all tokens in parallel. This makes Transformers orders of magnitude faster to train on modern GPUs.
- Long-range dependencies: In a Transformer, every token can directly attend to every other token (O(1) path length). In an RNN, information must pass through O(n) recurrence steps, suffering from vanishing gradients.
- Scalability: Transformers scale better with data and compute (scaling laws). Larger Transformers consistently improve, while larger RNNs hit diminishing returns.
Remaining RNN advantages:
- Linear inference cost: RNNs process each new token in O(1) time with constant memory. Transformers have O(n) per-token cost due to attending to all previous tokens (though KV-cache helps).
- Very long sequences: For sequences of 100K+ tokens, Transformer O(n^2) attention is prohibitively expensive. Recent architectures like Mamba (state-space models) combine RNN-like linear scaling with Transformer-like performance.
- Streaming/online: RNNs naturally process one token at a time with fixed memory. Good for real-time audio, sensor data.
Q11: What are state-space models (Mamba, S4) and how do they relate to RNNs?
State-space models (SSMs) are a class of sequence models based on continuous-time dynamical systems: dx/dt = Ax + Bu, y = Cx + Du. When discretized, they become a linear recurrence similar to an RNN but with structured state matrices.
Key innovations:
- S4 (2022): Uses a special diagonal-plus-low-rank structure for the state matrix A that can model very long-range dependencies. Can be computed as either a recurrence (for inference) or a convolution (for training), getting the best of both worlds.
- Mamba (2023): Adds input-dependent (selective) state transitions — the A, B, C matrices depend on the input, making it more expressive. Linear time training and inference.
Interview context: SSMs represent the cutting edge of efficient sequence modeling. They achieve Transformer-level performance on many tasks with linear rather than quadratic complexity. Know that they exist and can explain the trade-off: Transformers (quadratic, highly parallelizable, excellent performance) vs. SSMs (linear, efficient inference, competitive performance).
Q12: Implement a text classification model using LSTM in PyTorch.
A common coding question. The key is handling variable-length sequences correctly and extracting the right hidden state for classification.
import torch
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes,
num_layers=2, dropout=0.3, bidirectional=True):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim, hidden_dim, num_layers=num_layers,
batch_first=True, dropout=dropout if num_layers > 1 else 0,
bidirectional=bidirectional
)
self.dropout = nn.Dropout(dropout)
direction_factor = 2 if bidirectional else 1
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * direction_factor, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, input_ids, lengths):
# input_ids: (batch, max_seq_len), lengths: (batch,)
embedded = self.dropout(self.embedding(input_ids))
# Pack for efficient LSTM processing
packed = nn.utils.rnn.pack_padded_sequence(
embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
)
_, (h_n, _) = self.lstm(packed)
# h_n shape: (num_layers * num_directions, batch, hidden_dim)
# Take the last layer's forward and backward hidden states
if self.lstm.bidirectional:
hidden = torch.cat([h_n[-2], h_n[-1]], dim=-1)
else:
hidden = h_n[-1]
return self.classifier(self.dropout(hidden))
# Usage
model = LSTMClassifier(vocab_size=10000, embed_dim=128,
hidden_dim=256, num_classes=5)
input_ids = torch.randint(1, 10000, (4, 50)) # batch=4, max_len=50
lengths = torch.tensor([50, 35, 42, 28])
logits = model(input_ids, lengths)
print(f"Output: {logits.shape}") # [4, 5]
Key Takeaways
- Vanilla RNNs suffer from vanishing gradients; LSTM solves this with a cell state highway for gradient flow
- LSTM has 4 components (forget, input, candidate, output gates); GRU simplifies to 2 (update, reset) with comparable performance
- Bidirectional RNNs are for encoding (full sequence available), not generation
- Teacher forcing causes exposure bias — the model never learns to recover from its own mistakes
- Transformers replaced RNNs because of parallelization and better long-range dependencies, but RNNs/SSMs have linear inference cost
- Know how to handle variable-length sequences with pack_padded_sequence in PyTorch
Lilly Tech Systems