Intermediate

Neural Network Fundamentals

15 real interview questions covering the building blocks of every deep learning model. Each question includes a model answer at the depth interviewers expect, plus PyTorch code where relevant.

Q1: Why do we need non-linear activation functions? What happens if every layer uses a linear activation?

If every layer uses a linear activation f(x) = ax + b, then the entire network collapses into a single linear transformation regardless of depth. Mathematically: layer2(layer1(x)) = W2(W1*x + b1) + b2 = (W2*W1)*x + (W2*b1 + b2) = W'x + b'. No matter how many layers you stack, the result is equivalent to one linear layer.

Non-linear activations allow the network to learn non-linear decision boundaries, which is necessary for virtually all real-world problems (image classification, language modeling, etc.). The Universal Approximation Theorem states that a feedforward network with at least one hidden layer and a non-linear activation can approximate any continuous function on a compact subset of R^n.

Q2: Compare ReLU, Leaky ReLU, GELU, and Swish. When would you use each?

ReLU: f(x) = max(0, x) — Simple, fast, and works well in most CNNs. Problem: "dying ReLU" where neurons output 0 for all inputs and stop learning because the gradient is 0 for negative inputs.

Leaky ReLU: f(x) = x if x > 0, else alpha*x (alpha ~ 0.01) — Fixes dying ReLU by allowing a small gradient for negative values. Use when you observe many dead neurons during training.

GELU: f(x) = x * Phi(x) where Phi is the CDF of the standard normal — Smooth approximation of ReLU that allows small negative values through probabilistically. Default in Transformers (BERT, GPT). Use for NLP and Transformer architectures.

Swish/SiLU: f(x) = x * sigmoid(x) — Smooth, non-monotonic. Empirically outperforms ReLU in deep networks (EfficientNet). Use for very deep architectures where smoothness helps gradient flow.

import torch
import torch.nn as nn

# All activations in PyTorch
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
gelu = nn.GELU()
silu = nn.SiLU()  # Swish

x = torch.randn(5)
print(f"Input:      {x}")
print(f"ReLU:       {relu(x)}")
print(f"Leaky ReLU: {leaky_relu(x)}")
print(f"GELU:       {gelu(x)}")
print(f"Swish/SiLU: {silu(x)}")

Q3: Explain the vanishing and exploding gradient problems. How does each affect training?

Vanishing gradients: During backpropagation, gradients are multiplied through each layer via the chain rule. If these multiplied values are consistently < 1 (e.g., sigmoid derivatives which max out at 0.25), gradients shrink exponentially with depth. Early layers receive near-zero gradients and stop learning. Symptoms: loss plateaus early, early layer weights barely change.

Exploding gradients: If multiplied gradient values are consistently > 1, gradients grow exponentially. Symptoms: loss becomes NaN, weights become very large, training diverges.

Solutions for vanishing: ReLU activations, residual connections (ResNet), LSTM/GRU gates, proper weight initialization (He, Xavier), batch normalization.

Solutions for exploding: Gradient clipping (torch.nn.utils.clip_grad_norm_), proper initialization, learning rate reduction, batch normalization.

# Gradient clipping in PyTorch
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for batch in dataloader:
    loss = model(batch)
    loss.backward()

    # Clip gradients to max norm of 1.0
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()
    optimizer.zero_grad()

Q4: What is Xavier (Glorot) initialization and why does it work? When should you use He initialization instead?

Xavier initialization: Weights are drawn from a distribution with variance 2/(fan_in + fan_out), where fan_in and fan_out are the number of input and output units. This keeps the variance of activations and gradients roughly constant across layers, preventing both vanishing and exploding signals.

He initialization: Weights are drawn with variance 2/fan_in. This is designed specifically for ReLU activations, which zero out half the distribution. Xavier assumes a symmetric activation like tanh and would cause activations to shrink with ReLU.

Rule of thumb: Use Xavier for sigmoid/tanh activations. Use He (Kaiming) for ReLU/Leaky ReLU. Use the defaults in PyTorch nn.Linear (Kaiming uniform) unless you have a reason not to.

import torch.nn as nn
import torch.nn.init as init

layer = nn.Linear(256, 128)

# Xavier initialization (for tanh/sigmoid)
init.xavier_uniform_(layer.weight)
init.zeros_(layer.bias)

# He/Kaiming initialization (for ReLU)
init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
init.zeros_(layer.bias)

# Custom init for an entire model
def init_weights(m):
    if isinstance(m, nn.Linear):
        init.kaiming_normal_(m.weight, nonlinearity='relu')
        if m.bias is not None:
            init.zeros_(m.bias)

model = nn.Sequential(
    nn.Linear(784, 256), nn.ReLU(),
    nn.Linear(256, 128), nn.ReLU(),
    nn.Linear(128, 10)
)
model.apply(init_weights)

Q5: How does dropout work? Why is it applied differently during training and inference?

During training: Each neuron's output is independently set to 0 with probability p (typically 0.1–0.5). The remaining activations are scaled by 1/(1-p) to maintain the expected value. This prevents co-adaptation — neurons cannot rely on specific other neurons being present, forcing redundant representations.

During inference: All neurons are active (no dropout). Because we scaled during training (inverted dropout), no adjustment is needed at inference time. This is handled automatically by model.eval() in PyTorch.

Why it works: Dropout can be interpreted as training an ensemble of 2^n subnetworks (where n is the number of dropout-eligible neurons) that share weights. At inference, we effectively average all these subnetworks. It also acts as a regularizer, reducing overfitting.

Common mistake in interviews: Forgetting to call model.eval() before inference, which leaves dropout active and produces noisy, non-reproducible predictions.

import torch
import torch.nn as nn

class MLPWithDropout(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, drop_rate=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=drop_rate),  # 30% of neurons dropped
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=drop_rate),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

model = MLPWithDropout(784, 256, 10)

# Training: dropout is active
model.train()
out_train = model(torch.randn(1, 784))

# Inference: dropout is disabled
model.eval()
with torch.no_grad():
    out_eval = model(torch.randn(1, 784))

Q6: Explain batch normalization. What problem does it solve and how does it work mathematically?

Problem: Internal covariate shift — as weights in earlier layers change during training, the distribution of inputs to later layers shifts, making training unstable and slow.

How it works: For each mini-batch, BatchNorm normalizes the activations to zero mean and unit variance, then applies a learnable affine transformation:

1. Compute batch mean: mu_B = (1/m) * sum(x_i)
2. Compute batch variance: sigma_B^2 = (1/m) * sum((x_i - mu_B)^2)
3. Normalize: x_hat = (x - mu_B) / sqrt(sigma_B^2 + epsilon)
4. Scale and shift: y = gamma * x_hat + beta (gamma and beta are learnable)

Training vs. inference: During training, uses batch statistics. During inference, uses running averages of mean and variance accumulated during training. This is why batch size matters for BatchNorm — small batches give noisy statistics.

Benefits: Allows higher learning rates, acts as a regularizer, reduces sensitivity to initialization, smooths the loss landscape.

Q7: Compare BatchNorm, LayerNorm, GroupNorm, and InstanceNorm. When do you use each?

BatchNorm: Normalizes across the batch dimension. Best for CNNs with large batch sizes. Breaks down with small batches or in sequence models where sequence lengths vary.

LayerNorm: Normalizes across the feature dimension for each sample independently. Default for Transformers. Works with any batch size, even batch_size=1. No dependency on other samples in the batch.

GroupNorm: Divides channels into groups and normalizes within each group. Good compromise when batch size is small (e.g., object detection with large images). Typically use 32 groups.

InstanceNorm: Normalizes each channel of each sample independently. Used in style transfer where per-instance statistics carry style information.

# For CNNs with shape (batch, channels, height, width)
bn = nn.BatchNorm2d(64)       # Normalizes across batch for each channel
gn = nn.GroupNorm(32, 64)     # 32 groups of 2 channels each
in_ = nn.InstanceNorm2d(64)   # Each channel, each sample independently

# For Transformers with shape (batch, seq_len, d_model)
ln = nn.LayerNorm(512)        # Normalizes across d_model for each token

# Common mistake: using BatchNorm in Transformers
# BatchNorm would normalize across different tokens in different sequences
# which doesn't make semantic sense

Q8: What are skip connections (residual connections) and why do they enable training of very deep networks?

What: A skip connection adds the input of a block directly to its output: y = F(x) + x, where F is the block's transformation. The network only needs to learn the residual F(x) = y - x rather than the full mapping.

Why they work:

Gradient flow: The addition creates a direct gradient path that bypasses the layers. During backpropagation, the gradient of the identity path is always 1, preventing vanishing gradients regardless of depth.
Easy to learn identity: If a layer is not needed, the network can simply set F(x) to 0, making the block an identity mapping. Without skip connections, learning the identity is surprisingly hard.
Ensemble effect: A ResNet with n blocks can be viewed as an ensemble of 2^n paths of different lengths, providing implicit ensemble regularization.

Impact: Before ResNet (2015), networks deeper than ~20 layers were hard to train. ResNet enabled 152+ layer networks with better accuracy.

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(dim, dim),
            nn.ReLU(),
            nn.Linear(dim, dim),
        )
        self.activation = nn.ReLU()

    def forward(self, x):
        # Skip connection: add input to block output
        return self.activation(self.block(x) + x)

# Stack multiple residual blocks
class DeepResNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_blocks=10):
        super().__init__()
        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.blocks = nn.Sequential(
            *[ResidualBlock(hidden_dim) for _ in range(num_blocks)]
        )
        self.output_proj = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.input_proj(x)
        x = self.blocks(x)
        return self.output_proj(x)

Q9: Explain cross-entropy loss. Why is it preferred over MSE for classification?

Cross-entropy loss: L = -sum(y_i * log(p_i)), where y is the one-hot true label and p is the predicted probability. For binary classification: L = -[y*log(p) + (1-y)*log(1-p)].

Why not MSE for classification?

Gradient magnitude: With MSE + sigmoid, the gradient is proportional to (p-y) * p * (1-p). When the prediction is confidently wrong (p near 0 or 1), the sigmoid derivative p*(1-p) is near 0, making gradients tiny — the model learns slowly from its biggest mistakes. Cross-entropy's gradient is simply (p-y), giving strong gradients for wrong predictions.
Convexity: Cross-entropy is convex with respect to the logits (pre-softmax values) when combined with softmax. MSE is not, creating more local minima.
Probabilistic interpretation: Cross-entropy is the negative log-likelihood under a categorical distribution, making it the principled choice from a maximum likelihood perspective.

Q10: What is the difference between SGD, Adam, and AdamW? When would you choose each?

SGD with momentum: v_t = beta*v_{t-1} + grad; w = w - lr*v_t. Simple, well-understood. Often achieves the best final performance on CNNs with careful tuning. Used in most vision papers (ResNet, EfficientNet). Requires more tuning of learning rate and schedule.

Adam: Maintains per-parameter adaptive learning rates using first moment (mean) and second moment (uncentered variance) of gradients. Converges faster with less tuning. Problem: L2 regularization in Adam doesn't work the same as in SGD because the adaptive learning rates interact with the weight decay.

AdamW: Decouples weight decay from the gradient-based update. Applies weight decay directly to the weights rather than adding it to the gradient. This is the correct implementation of weight decay for adaptive optimizers. Default for Transformers (BERT, GPT, etc.).

When to use each: SGD+momentum for CNNs where you can tune carefully. AdamW for Transformers and when you want fast convergence with less tuning. Avoid plain Adam — use AdamW instead.

# SGD with momentum - good for CNNs
optimizer_sgd = torch.optim.SGD(
    model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4
)

# AdamW - good for Transformers
optimizer_adamw = torch.optim.AdamW(
    model.parameters(), lr=1e-4, betas=(0.9, 0.999),
    eps=1e-8, weight_decay=0.01
)

# Common learning rate schedules
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

# Cosine annealing (popular with SGD)
scheduler = CosineAnnealingLR(optimizer_sgd, T_max=100, eta_min=1e-6)

# One-cycle (fast convergence)
scheduler = OneCycleLR(
    optimizer_adamw, max_lr=1e-3, total_steps=10000
)

Q11: What is backpropagation? Walk through it for a simple 2-layer network.

Backpropagation is an efficient algorithm for computing gradients of the loss with respect to all parameters using the chain rule, working backwards from the output layer.

Forward pass: z1 = W1*x + b1, a1 = ReLU(z1), z2 = W2*a1 + b2, y_hat = softmax(z2), L = cross_entropy(y, y_hat).

Backward pass:

dL/dz2 = y_hat - y (softmax + cross-entropy gradient)
dL/dW2 = (dL/dz2) * a1^T
dL/db2 = dL/dz2
dL/da1 = W2^T * (dL/dz2)
dL/dz1 = (dL/da1) * ReLU'(z1) — element-wise multiply with 1 where z1 > 0, 0 otherwise
dL/dW1 = (dL/dz1) * x^T
dL/db1 = dL/dz1

The key insight is that we compute and cache intermediate values during the forward pass, then reuse them during the backward pass. PyTorch's autograd handles this automatically by building a computation graph.

Q12: What is the difference between L1 and L2 regularization? When do you prefer one over the other?

L1 regularization (Lasso): Adds sum(|w_i|) to the loss. Produces sparse weights (many exactly zero). Good for feature selection — effectively removes irrelevant features. The gradient is +1 or -1 (sign of weight), which pushes small weights to exactly zero.

L2 regularization (Ridge/Weight Decay): Adds sum(w_i^2) to the loss. Produces small but non-zero weights. Distributes weight magnitude more evenly. The gradient is proportional to the weight value, so large weights are penalized more heavily. More commonly used in deep learning.

When to use L1: When you suspect many features are irrelevant and want automatic feature selection. Sparse models are also more interpretable and compress better.

When to use L2: Default choice for deep learning. When you want to prevent large weights without zeroing them out. Better for dense networks where all features contribute.

Q13: What is the dying ReLU problem and how do you detect and fix it?

Problem: A ReLU neuron "dies" when its input is always negative — the output is always 0 and the gradient is always 0, so the weights never update. Once dead, a neuron stays dead permanently. This can happen when a large gradient update pushes the bias so negative that no input can produce a positive pre-activation.

Detection: Monitor the percentage of zero activations across layers. If a layer has >50% zero activations consistently, you likely have dying neurons. In PyTorch, hook into the forward pass to track this.

Fixes:

Use Leaky ReLU or ELU (non-zero gradient for negative inputs)
Lower the learning rate (prevents large weight updates)
Use He initialization (properly scales for ReLU)
Add batch normalization before ReLU (keeps pre-activations centered)

Q14: Explain the bias-variance tradeoff in the context of neural networks. How do you diagnose each?

High bias (underfitting): The model is too simple to capture the underlying patterns. Training loss is high. Train and validation losses are both high and close together. Fix: increase model capacity (more layers/neurons), train longer, reduce regularization.

High variance (overfitting): The model memorizes training data but fails to generalize. Training loss is low but validation loss is much higher. Fix: add dropout, increase weight decay, use data augmentation, get more data, reduce model size, early stopping.

The sweet spot: Modern deep learning often uses very large models (low bias) combined with strong regularization (controlled variance). This "double descent" phenomenon shows that very large models can actually generalize better than medium-sized ones, contradicting the classical U-shaped bias-variance curve.

Q15: Implement a complete training loop in PyTorch with all best practices.

A production-quality training loop includes: model.train()/model.eval() toggling, gradient zeroing, gradient clipping, learning rate scheduling, validation loop with torch.no_grad(), and metric tracking.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import CosineAnnealingLR

def train_model(model, train_loader, val_loader, epochs=50, lr=1e-3):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    scheduler = CosineAnnealingLR(optimizer, T_max=epochs)
    best_val_loss = float('inf')

    for epoch in range(epochs):
        # --- Training ---
        model.train()
        train_loss, correct, total = 0.0, 0, 0

        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()

            train_loss += loss.item() * inputs.size(0)
            _, predicted = outputs.max(1)
            correct += predicted.eq(targets).sum().item()
            total += targets.size(0)

        scheduler.step()
        train_loss /= total
        train_acc = correct / total

        # --- Validation ---
        model.eval()
        val_loss, val_correct, val_total = 0.0, 0, 0

        with torch.no_grad():
            for inputs, targets in val_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, targets)

                val_loss += loss.item() * inputs.size(0)
                _, predicted = outputs.max(1)
                val_correct += predicted.eq(targets).sum().item()
                val_total += targets.size(0)

        val_loss /= val_total
        val_acc = val_correct / val_total

        print(f"Epoch {epoch+1}/{epochs} | "
              f"Train Loss: {train_loss:.4f} Acc: {train_acc:.4f} | "
              f"Val Loss: {val_loss:.4f} Acc: {val_acc:.4f}")

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pt')

    return model

Key Takeaways

💡

Non-linearity is essential — without it, deep networks collapse to a single linear layer
Use He initialization for ReLU, Xavier for tanh/sigmoid, and GELU/SiLU for Transformers
BatchNorm for CNNs, LayerNorm for Transformers — know why each is appropriate
Skip connections solve vanishing gradients and enable training networks with 100+ layers
Use AdamW for Transformers, SGD+momentum for CNNs, and always write complete training loops with eval mode and no_grad

← Previous DL Interview Overview Next → CNN Interview Questions