Intermediate

CNN Interview Questions

12 real interview questions on Convolutional Neural Networks, from basic convolution math to modern architectures. These questions are commonly asked at Google, Meta, and computer vision companies.

Q1: How does a convolution operation work? Calculate the output size given input, kernel, stride, and padding.

A

A 2D convolution slides a kernel (filter) across the input, computing element-wise multiplication and summation at each position. The kernel shares weights across all spatial positions (weight sharing), which is what makes CNNs parameter-efficient compared to fully connected layers.

Output size formula: O = floor((W - K + 2P) / S) + 1, where W = input size, K = kernel size, P = padding, S = stride.

Example: Input 32x32, kernel 3x3, stride 1, padding 1: O = (32 - 3 + 2) / 1 + 1 = 32. Same padding preserves spatial dimensions.

Parameter count: For C_in input channels and C_out output channels with kernel K: params = C_out * (C_in * K * K + 1). A 3x3 conv with 64 input and 128 output channels: 128 * (64 * 9 + 1) = 73,856 parameters.

import torch
import torch.nn as nn

# Standard convolution
conv = nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=64,     # 64 output feature maps
    kernel_size=3,       # 3x3 kernel
    stride=1,
    padding=1            # 'same' padding
)

x = torch.randn(1, 3, 224, 224)  # (batch, channels, height, width)
out = conv(x)
print(f"Input:  {x.shape}")        # [1, 3, 224, 224]
print(f"Output: {out.shape}")       # [1, 64, 224, 224]
print(f"Params: {sum(p.numel() for p in conv.parameters())}")  # 1,792

Q2: What is the difference between valid, same, and full padding?

A

Valid (no padding): P=0. Output shrinks by (K-1) pixels per dimension. 224x224 input with 3x3 kernel gives 222x222 output. Each output pixel only uses valid input pixels.

Same padding: P = (K-1)/2 (for odd kernels with stride 1). Output has the same spatial dimensions as input. 224x224 stays 224x224. Most common in modern architectures.

Full padding: P = K-1. Output is larger than input. Used in transposed convolutions for upsampling. Each input pixel contributes to K*K output positions.

Q3: Explain max pooling vs. average pooling. What is global average pooling and why has it replaced fully connected layers?

A

Max pooling: Takes the maximum value in each window. Preserves the strongest feature activation. Provides slight translation invariance. Most commonly 2x2 with stride 2, halving spatial dimensions.

Average pooling: Takes the mean of each window. Smoother, less aggressive. Can lose important activations but retains more spatial context.

Global average pooling (GAP): Averages each feature map into a single value, reducing (batch, C, H, W) to (batch, C). Used at the end of modern CNNs (ResNet, EfficientNet) instead of flattening + FC layers. Benefits: no parameters, no overfitting risk, works with any input resolution, acts as a structural regularizer.

Classic approach (VGG): Flatten 7x7x512 = 25,088 dimensions, then FC(25088, 4096) = 102M parameters. Modern (ResNet): GAP 7x7x512 to 512, then FC(512, 1000) = 512K parameters. That is a 200x reduction in parameters.

Q4: What is the receptive field and why does it matter?

A

Definition: The receptive field of a neuron is the region of the input image that affects its output. A single 3x3 conv has a 3x3 receptive field. Two stacked 3x3 convs have a 5x5 receptive field. Three stacked 3x3 convs have a 7x7 receptive field.

Why it matters: The receptive field determines what context the network can "see" at each layer. For object detection, the receptive field at the output layer must be large enough to encompass the objects you want to detect. If the receptive field is too small, the network literally cannot see the whole object.

How to increase it: Stack more layers (additive), use larger kernels (expensive), use dilated convolutions (efficient), use pooling/strided convolutions (multiplicative).

Key insight for interviews: Two stacked 3x3 convolutions have the same receptive field as one 5x5 convolution (both are 5x5) but with fewer parameters: 2*(3*3) = 18 vs. 5*5 = 25. Three stacked 3x3 convs equal one 7x7 conv: 3*9 = 27 vs. 49. This is why VGG and later architectures use small 3x3 kernels exclusively.

Q5: What is a 1x1 convolution and why is it useful?

A

A 1x1 convolution applies a fully connected layer independently at each spatial position across all channels. It does not look at neighboring pixels — it only mixes information across channels.

Uses:

  • Channel reduction: Reduce channels from 256 to 64 before an expensive 3x3 conv (bottleneck design in ResNet). This dramatically reduces computation.
  • Channel expansion: Increase channels to add representational capacity.
  • Cross-channel interaction: Learn relationships between different feature maps without spatial mixing.
  • Adding non-linearity: With a ReLU after it, adds a non-linear transformation without changing spatial dimensions.

Computation savings: Going from 256 channels directly to 256 channels with a 3x3 conv: 256*256*9 = 589,824 multiply-adds per pixel. With a 1x1 bottleneck (256→64→256): 256*64 + 64*64*9 + 64*256 = 69,632. That is 8.5x fewer operations.

Q6: Explain the ResNet architecture. What is the bottleneck block?

A

ResNet (2015) introduced residual connections that allow training networks with 100+ layers. The key insight: instead of learning H(x), learn the residual F(x) = H(x) - x, so the output is F(x) + x.

Basic block (ResNet-18/34): Two 3x3 convolutions with a skip connection. Used for shallower variants.

Bottleneck block (ResNet-50/101/152): Three convolutions: 1x1 (reduce channels) → 3x3 (process) → 1x1 (expand channels). The 1x1 convolutions form a bottleneck that reduces computation while maintaining representational power.

When dimensions change: When spatial dimensions are halved (stride 2) or channels change, the skip connection uses a 1x1 conv with matching stride to align dimensions.

import torch
import torch.nn as nn

class BottleneckBlock(nn.Module):
    """ResNet bottleneck block: 1x1 -> 3x3 -> 1x1"""
    expansion = 4

    def __init__(self, in_channels, mid_channels, stride=1, downsample=None):
        super().__init__()
        out_channels = mid_channels * self.expansion
        self.conv1 = nn.Conv2d(in_channels, mid_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid_channels)
        self.conv2 = nn.Conv2d(mid_channels, mid_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(mid_channels)
        self.conv3 = nn.Conv2d(mid_channels, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample  # 1x1 conv for skip when dims change

    def forward(self, x):
        identity = x
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  # Residual connection
        return self.relu(out)

Q7: Compare VGG, ResNet, and EfficientNet. What are the key architectural innovations of each?

A
ArchitectureYearKey InnovationParams (ImageNet)Top-1 Acc
VGG-162014Showed that small 3x3 kernels stacked deep beat large kernels. Simple, uniform architecture.138M71.5%
ResNet-502015Skip connections enable 150+ layer networks. Bottleneck blocks for efficiency.25.6M76.1%
EfficientNet-B02019Compound scaling (width, depth, resolution scaled together). MBConv blocks with squeeze-and-excitation. Neural architecture search.5.3M77.1%

Trend: Each generation achieves better accuracy with fewer parameters by using smarter architectural patterns rather than brute-force depth/width.

Q8: What is depthwise separable convolution and why is it used in MobileNet/EfficientNet?

A

A depthwise separable convolution splits a standard convolution into two steps:

1. Depthwise convolution: Apply a separate K*K filter to each input channel independently. If input has C channels, use C separate K*K filters. This handles spatial filtering.

2. Pointwise convolution (1x1): Apply C_out 1x1 filters across all channels. This handles cross-channel mixing.

Computation savings: Standard conv: C_in * C_out * K^2 * H * W operations. Depthwise separable: C_in * K^2 * H * W + C_in * C_out * H * W. Ratio: roughly 1/C_out + 1/K^2. For a 3x3 conv with 256 output channels, this is ~8-9x fewer operations.

# Standard convolution
standard_conv = nn.Conv2d(64, 128, kernel_size=3, padding=1)
# Params: 64 * 128 * 3 * 3 + 128 = 73,856

# Depthwise separable convolution
depthwise_conv = nn.Conv2d(64, 64, kernel_size=3, padding=1, groups=64)  # groups=in_channels
pointwise_conv = nn.Conv2d(64, 128, kernel_size=1)
# Params: 64 * 1 * 3 * 3 + 64 + 64 * 128 + 128 = 8,896  (~8.3x fewer)

Q9: How does transfer learning work for CNNs? When do you freeze layers vs. fine-tune?

A

Transfer learning uses a model pre-trained on a large dataset (typically ImageNet) and adapts it to a new task. Lower layers learn general features (edges, textures) that transfer well. Higher layers learn task-specific features.

Strategy 1 - Feature extraction (freeze all): Replace the final classifier layer, freeze all other weights. Train only the new classifier. Use when: small target dataset, target domain is similar to ImageNet.

Strategy 2 - Fine-tune top layers: Freeze early layers, unfreeze later layers + new classifier. Use when: medium dataset, some domain shift.

Strategy 3 - Fine-tune all: Unfreeze everything, train with a small learning rate. Use when: large dataset or significant domain shift (e.g., medical images).

Key trick: Use discriminative learning rates — lower LR for early layers (1e-5), higher for later layers (1e-3). This prevents destroying the pre-trained features.

import torchvision.models as models

# Load pre-trained ResNet-50
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

# Strategy 1: Feature extraction - freeze everything
for param in model.parameters():
    param.requires_grad = False

# Replace classifier for new task (e.g., 5 classes)
model.fc = nn.Linear(model.fc.in_features, 5)
# Only model.fc parameters are trainable

# Strategy 2: Fine-tune with discriminative learning rates
optimizer = torch.optim.AdamW([
    {'params': model.layer1.parameters(), 'lr': 1e-5},
    {'params': model.layer2.parameters(), 'lr': 5e-5},
    {'params': model.layer3.parameters(), 'lr': 1e-4},
    {'params': model.layer4.parameters(), 'lr': 5e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3},
], weight_decay=0.01)

Q10: What is a dilated (atrous) convolution? When would you use it?

A

A dilated convolution inserts gaps (zeros) between kernel elements, expanding the receptive field without increasing parameters or reducing resolution. A 3x3 kernel with dilation rate 2 has a 5x5 receptive field but only 9 parameters.

Use cases:

  • Semantic segmentation (DeepLab): Need large receptive field but cannot afford to lose spatial resolution through pooling
  • Audio processing (WaveNet): Exponentially increasing dilation rates (1, 2, 4, 8, ...) capture long-range temporal dependencies efficiently

Receptive field formula: R = K + (K-1) * (dilation - 1). For 3x3 with dilation 4: R = 3 + 2*3 = 9.

Q11: What is a transposed convolution (deconvolution)? How does it differ from upsampling + convolution?

A

Transposed convolution: Increases spatial dimensions by inserting zeros between input pixels and applying a convolution. The output size formula is: O = (I - 1) * S - 2P + K. Learnable upsampling — the network learns how to upsample.

Problem: Transposed convolutions create checkerboard artifacts due to uneven overlap of the kernel at different positions.

Alternative: Nearest-neighbor or bilinear upsampling followed by a regular convolution. Avoids checkerboard artifacts. Used in modern architectures (U-Net variants, StyleGAN2).

Interview tip: If asked "how do you upsample feature maps?", mention both approaches and explain why the upsampling + conv approach is preferred in practice despite transposed convolution being more "elegant."

Q12: Implement a simple CNN classifier for CIFAR-10 in PyTorch.

A

This is a common live coding question. The interviewer wants to see that you understand the data flow through conv → activation → pool → flatten → FC layers, and that you can calculate output dimensions correctly.

import torch
import torch.nn as nn

class CIFAR10CNN(nn.Module):
    """Simple CNN for CIFAR-10 (32x32x3 input, 10 classes)"""
    def __init__(self):
        super().__init__()
        # Feature extractor
        self.features = nn.Sequential(
            # Block 1: 32x32x3 -> 32x32x32 -> 16x16x32
            nn.Conv2d(3, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),

            # Block 2: 16x16x32 -> 16x16x64 -> 8x8x64
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),

            # Block 3: 8x8x64 -> 8x8x128 -> 4x4x128
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            nn.Dropout2d(0.25),
        )

        # Classifier
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # Global average pooling -> 1x1x128
            nn.Flatten(),
            nn.Linear(128, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Verify shapes
model = CIFAR10CNN()
x = torch.randn(2, 3, 32, 32)
print(f"Output shape: {model(x).shape}")  # [2, 10]
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")  # ~330K

Key Takeaways

💡
  • Always calculate output dimensions: O = floor((W - K + 2P) / S) + 1
  • Two 3x3 convs = one 5x5 receptive field with fewer parameters — this is why modern CNNs use small kernels
  • 1x1 convolutions reduce channels in bottleneck blocks, saving 8-9x computation
  • Global average pooling replaced FC layers, cutting parameters by 200x with no accuracy loss
  • Transfer learning strategy depends on dataset size and domain similarity: freeze, partial fine-tune, or full fine-tune
  • Depthwise separable convolutions (MobileNet) cut computation ~8x at minimal accuracy cost