Intermediate

Image Classification Questions

These 12 questions cover the foundational image classification concepts that appear in nearly every computer vision interview. CNN architectures, transfer learning, data augmentation, and evaluation metrics are tested across all CV roles.

Q1: Explain how a convolutional layer works. What are its key parameters?

💡

Model Answer:

A convolutional layer slides a set of learnable filters (kernels) across the input feature map, computing element-wise multiplication and summation at each position to produce an output feature map.

Key parameters:

Number of filters (out_channels): Each filter learns to detect a different feature. More filters = more capacity but more compute.
Kernel size: Typically 3x3 (most common), 5x5, or 1x1 (for channel mixing). Larger kernels capture broader spatial patterns but are computationally expensive.
Stride: Step size for sliding the filter. Stride=2 halves the spatial dimensions (used instead of pooling in modern architectures like ResNet-D).
Padding: Added border pixels. "Same" padding (padding=kernel_size//2) preserves spatial dimensions. "Valid" padding (no padding) shrinks the output.
Dilation: Spacing between kernel elements. Dilation=2 gives a 3x3 kernel an effective receptive field of 5x5 without increasing parameters.

Output size formula: output = (input + 2*padding - dilation*(kernel-1) - 1) / stride + 1

Parameter count: kernel_h * kernel_w * in_channels * out_channels + out_channels (bias)

Q2: Why do CNNs use pooling layers? Compare max pooling vs average pooling.

💡

Model Answer:

Pooling layers reduce spatial dimensions, lowering computational cost and providing a degree of translation invariance. They aggregate local features into a summary statistic.

Aspect	Max Pooling	Average Pooling
Operation	Takes the maximum value in each window	Takes the mean of values in each window
Behavior	Retains the strongest activation (most prominent feature)	Smooths activations, retains overall distribution
When to use	Feature detection tasks where presence of a feature matters more than its exact magnitude	Tasks where global context matters; often used as global average pooling before the classifier head
Gradient flow	Gradient flows only through the max element (sparse gradients)	Gradient is distributed equally across all elements in the window

Modern trend: Many recent architectures (ConvNeXt, EfficientNetV2) replace explicit pooling with strided convolutions, which learn the downsampling operation. Global Average Pooling (GAP) before the final classifier is nearly universal, replacing the large fully-connected layers used in AlexNet/VGG.

Q3: Explain ResNet's skip connections. Why do they work?

💡

Model Answer:

ResNet introduced residual connections (skip connections) where the input to a block is added to the block's output: y = F(x) + x. Instead of learning the desired mapping H(x) directly, the network learns the residual F(x) = H(x) - x.

Why they work (three complementary explanations):

Gradient flow: During backpropagation, the identity shortcut provides a direct gradient path from later layers to earlier layers, mitigating vanishing gradients. The gradient of the identity is 1, so it never shrinks.
Easier optimization: Learning to output zero (making F(x)=0, so the block is an identity) is easier than learning an identity mapping from scratch. This means adding layers can never hurt — worst case, they learn to be no-ops.
Ensemble effect: Veit et al. (2016) showed that ResNets behave like ensembles of many shorter networks. Removing individual layers has little impact because information flows through multiple paths of different lengths.

Variants:

Pre-activation ResNet: Places BN and ReLU before the convolution (BN-ReLU-Conv instead of Conv-BN-ReLU). Improves gradient flow further.
Bottleneck block: Uses 1x1 → 3x3 → 1x1 convolutions to reduce channel dimensions, enabling deeper networks (ResNet-50/101/152).
ResNeXt: Adds grouped convolutions for more capacity at the same compute cost.

Q4: Compare the evolution of CNN architectures from AlexNet to EfficientNet.

💡

Model Answer:

Architecture	Year	Key Innovation	Top-1 ImageNet	Params
AlexNet	2012	First deep CNN to win ImageNet. Used ReLU, dropout, GPU training	63.3%	61M
VGG-16	2014	Showed that depth matters. Stacked 3x3 convolutions (two 3x3 = one 5x5 receptive field, fewer params)	74.4%	138M
GoogLeNet/Inception	2014	Inception modules with parallel branches (1x1, 3x3, 5x5, pool). Reduced params via 1x1 bottlenecks	74.8%	6.8M
ResNet-50	2015	Skip connections enabling 50–152 layers. Bottleneck blocks. Batch normalization	76.1%	25.6M
DenseNet-121	2017	Dense connections: each layer receives features from all previous layers. Extreme feature reuse	74.4%	8M
MobileNetV2	2018	Inverted residuals with depthwise separable convolutions. Designed for mobile/edge devices	72.0%	3.4M
EfficientNet-B0	2019	Compound scaling (depth, width, resolution scaled together). NAS-designed. Squeeze-and-Excitation blocks	77.1%	5.3M
ConvNeXt	2022	Modernized ResNet with ViT design choices: patchify stem, larger kernels (7x7), fewer normalization layers, GELU	82.1%	29M

Key insight for interviews: The trend has been toward deeper networks with better gradient flow (skip connections), efficient parameter use (depthwise separable convolutions, bottleneck blocks), and automatic architecture design (NAS). ConvNeXt showed that pure CNNs can match vision transformers when modernized with similar training recipes.

Q5: What is transfer learning and when should you use it?

💡

Model Answer:

Transfer learning uses a model pretrained on a large dataset (typically ImageNet with 1.2M images) as a starting point for a new task. The pretrained model has already learned general visual features (edges, textures, shapes, parts) that transfer to most vision tasks.

Two main strategies:

Feature extraction: Freeze all pretrained layers, replace only the classifier head, train the head on your data. Best when you have very little data (<1K images) or your task is similar to ImageNet.
Fine-tuning: Unfreeze some or all pretrained layers and train with a low learning rate. Best when you have moderate data (1K–100K images) and your domain differs from ImageNet (e.g., medical images, satellite imagery).

Best practices:

Use differential learning rates: lower LR for early layers (1e-5), higher for later layers and head (1e-3)
Gradually unfreeze layers (progressive unfreezing) to avoid catastrophic forgetting
Match the input preprocessing (normalization) to what the pretrained model expects
For very different domains (medical, satellite), consider pretraining on domain-specific data first

import torchvision.models as models
import torch.nn as nn

# Feature extraction: freeze backbone, train only head
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
for param in model.parameters():
    param.requires_grad = False

# Replace classifier for 10-class problem
model.fc = nn.Linear(model.fc.in_features, 10)

# Fine-tuning: unfreeze last 2 residual blocks
for param in model.layer4.parameters():
    param.requires_grad = True
for param in model.layer3.parameters():
    param.requires_grad = True

Q6: What is batch normalization and why is it important in CNNs?

💡

Model Answer:

Batch normalization normalizes activations across the batch dimension for each channel: x_norm = (x - mean) / sqrt(var + eps), followed by learnable scale (gamma) and shift (beta) parameters: y = gamma * x_norm + beta.

Why it works:

Reduces internal covariate shift: Stabilizes the distribution of inputs to each layer, so later layers do not have to constantly adapt to shifting input distributions.
Enables higher learning rates: Without BN, high learning rates cause divergence. BN makes the loss landscape smoother, allowing faster training.
Acts as regularization: The noise from batch statistics (different mini-batches produce slightly different means and variances) provides a regularizing effect similar to dropout.

Key interview details:

Training vs inference: During training, uses batch statistics. During inference, uses running mean and variance accumulated during training. This is why model.eval() matters.
Small batch sizes: BN becomes unstable with very small batches (<16). Use Group Normalization or Layer Normalization instead.
Placement: Original paper: Conv → BN → ReLU. Pre-activation ResNet: BN → ReLU → Conv (better gradient flow).

Q7: Describe 5 data augmentation techniques and when each is most effective.

💡

Model Answer:

Technique	What It Does	When Most Effective
Random Horizontal Flip	Mirrors image left-right with 50% probability	Nearly always. Do not use for tasks where orientation matters (text recognition, medical imaging with laterality)
Random Crop + Resize	Crops a random portion and resizes to target size	Forces the model to learn scale invariance. The default augmentation for ImageNet training (RandomResizedCrop)
Color Jitter	Randomly adjusts brightness, contrast, saturation, hue	Outdoor scenes with variable lighting. Less useful for medical/industrial images where color carries diagnostic information
Mixup / CutMix	Mixup: blends two images and their labels. CutMix: pastes a patch from one image onto another	Improves calibration and reduces overconfidence. CutMix works better than Mixup for object detection because it preserves local features
RandAugment	Applies N random transformations from a predefined set, each at magnitude M	When you want strong augmentation without tuning individual transforms. Only 2 hyperparameters (N, M) instead of dozens

import torchvision.transforms as T

# Standard ImageNet augmentation pipeline
train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.08, 1.0)),
    T.RandomHorizontalFlip(),
    T.ColorJitter(brightness=0.4, contrast=0.4,
                  saturation=0.4, hue=0.1),
    T.RandAugment(num_ops=2, magnitude=9),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
])

# CutMix (applied at the batch level)
from torchvision.transforms import v2
cutmix = v2.CutMix(num_classes=1000)
# Apply: images, labels = cutmix(images, labels)

Q8: What is the difference between 1x1 convolution and a fully connected layer?

💡

Model Answer:

A 1x1 convolution operates independently at each spatial position, mixing channels without any spatial context. A fully connected layer flattens the entire feature map and connects every neuron to every output.

1x1 convolution: Parameters = in_channels * out_channels + out_channels. Preserves spatial structure. Works on any input size. Used for channel dimensionality reduction (bottleneck blocks), channel mixing, and adding non-linearity without spatial processing.
Fully connected layer: Parameters = (H * W * in_channels) * out_features + out_features. Destroys spatial structure. Fixed input size. Used at the end for classification.

Key insight: A 1x1 convolution applied to a 1x1 feature map is mathematically identical to a fully connected layer. This is why modern architectures use Global Average Pooling (reducing spatial dims to 1x1) followed by a 1x1 conv (or equivalently, a linear layer) — it provides input-size flexibility with minimal parameters.

Use in Network-in-Network and Inception: 1x1 convolutions were popularized by Lin et al. (2013) and used extensively in GoogLeNet to reduce channel dimensions before expensive 3x3 and 5x5 convolutions, dramatically cutting computation.

Q9: How do you handle class imbalance in image classification?

💡

Model Answer:

Class imbalance is common in real-world CV: medical imaging (1% positive), defect detection (0.1% defective), wildlife monitoring (rare species). Here are the main strategies, ordered by effectiveness:

Weighted loss function: Assign higher weight to minority classes. CrossEntropyLoss(weight=class_weights) where weights are inversely proportional to class frequency. Simple and effective as a baseline.
Focal Loss: Adds a modulating factor (1-p_t)^gamma that down-weights easy examples and focuses on hard ones. Originally designed for dense object detection (RetinaNet) but works well for classification too. Gamma=2 is the standard starting point.
Oversampling minority classes: Use WeightedRandomSampler to sample minority classes more frequently. Each epoch sees a balanced distribution regardless of actual class frequencies.
Data augmentation on minority classes: Apply heavier augmentation specifically to underrepresented classes. Can be combined with oversampling.
Two-stage training: First train on balanced samples, then fine-tune the classifier head on the natural distribution. This decouples representation learning from classifier calibration.

import torch
import torch.nn as nn
from torch.utils.data import WeightedRandomSampler

# Strategy 1: Weighted Cross-Entropy
class_counts = [5000, 500, 50]  # highly imbalanced
weights = 1.0 / torch.tensor(class_counts, dtype=torch.float)
weights = weights / weights.sum()  # normalize
criterion = nn.CrossEntropyLoss(weight=weights)

# Strategy 2: Focal Loss
class FocalLoss(nn.Module):
    def __init__(self, alpha=None, gamma=2.0):
        super().__init__()
        self.alpha = alpha  # class weights
        self.gamma = gamma

    def forward(self, logits, targets):
        ce_loss = nn.functional.cross_entropy(
            logits, targets, weight=self.alpha, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss
        return focal_loss.mean()

# Strategy 3: WeightedRandomSampler
sample_weights = [weights[label] for label in dataset.targets]
sampler = WeightedRandomSampler(sample_weights, len(sample_weights))
loader = DataLoader(dataset, batch_size=32, sampler=sampler)

Q10: Explain depthwise separable convolutions. Why are they efficient?

💡

Model Answer:

A depthwise separable convolution factors a standard convolution into two steps:

Depthwise convolution: Applies a single filter per input channel (groups=in_channels). Each channel is filtered independently. This captures spatial patterns within each channel.
Pointwise convolution: A 1x1 convolution that mixes the channel outputs from the depthwise step. This captures cross-channel correlations.

Computational savings:

Standard conv: K*K * C_in * C_out * H * W operations
Depthwise separable: K*K * C_in * H * W + C_in * C_out * H * W
Reduction factor: 1/C_out + 1/K^2. For K=3, C_out=256: ~9x fewer operations

Used in: MobileNet (V1, V2, V3), EfficientNet, Xception. The inverted residual block in MobileNetV2 uses: pointwise expansion → depthwise 3x3 → pointwise projection, with a skip connection on the narrow (projected) representation.

Q11: Write a complete image classification training loop in PyTorch.

💡

Model Answer:

This is a production-quality training loop that interviewers expect you to write from memory:

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as T
from torch.utils.data import DataLoader
from torchvision.datasets import ImageFolder
from torch.optim.lr_scheduler import CosineAnnealingLR

def train_classifier(data_dir, num_classes, epochs=30, lr=1e-3):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Data pipeline with augmentation
    train_transform = T.Compose([
        T.RandomResizedCrop(224),
        T.RandomHorizontalFlip(),
        T.ColorJitter(0.4, 0.4, 0.4, 0.1),
        T.ToTensor(),
        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ])
    val_transform = T.Compose([
        T.Resize(256),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ])

    train_ds = ImageFolder(f"{data_dir}/train", train_transform)
    val_ds = ImageFolder(f"{data_dir}/val", val_transform)
    train_loader = DataLoader(train_ds, batch_size=32, shuffle=True,
                              num_workers=4, pin_memory=True)
    val_loader = DataLoader(val_ds, batch_size=64, num_workers=4,
                            pin_memory=True)

    # Model: fine-tune pretrained ResNet-50
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
    model.fc = nn.Linear(model.fc.in_features, num_classes)
    model = model.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    scheduler = CosineAnnealingLR(optimizer, T_max=epochs)
    scaler = torch.amp.GradScaler("cuda")  # mixed precision

    best_acc = 0.0
    for epoch in range(epochs):
        # Training
        model.train()
        running_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()

            with torch.amp.autocast("cuda"):  # FP16 forward
                outputs = model(images)
                loss = criterion(outputs, labels)

            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
            running_loss += loss.item()

        scheduler.step()

        # Validation
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = outputs.max(1)
                total += labels.size(0)
                correct += predicted.eq(labels).sum().item()

        acc = 100.0 * correct / total
        print(f"Epoch {epoch+1}/{epochs} | "
              f"Loss: {running_loss/len(train_loader):.4f} | "
              f"Val Acc: {acc:.2f}%")

        if acc > best_acc:
            best_acc = acc
            torch.save(model.state_dict(), "best_model.pth")

    return model

Q12: What metrics do you use to evaluate image classification models?

💡

Model Answer:

Metric	Formula / Description	When to Use
Accuracy	Correct / Total	Balanced datasets only. Misleading when classes are imbalanced (99% accuracy by predicting majority class)
Top-5 Accuracy	Correct label in top-5 predictions	Large-scale classification (ImageNet). Shows if the model is "close" even when top-1 is wrong
Precision	TP / (TP + FP)	When false positives are costly (spam filtering, content moderation)
Recall	TP / (TP + FN)	When false negatives are costly (medical diagnosis, defect detection)
F1 Score	2 * (Precision * Recall) / (Precision + Recall)	Imbalanced datasets. Use macro-F1 (average per class) or weighted-F1 (weighted by class support)
AUC-ROC	Area under the ROC curve (TPR vs FPR)	Binary classification. Threshold-independent evaluation. Useful when the operating threshold is unknown
Confusion Matrix	Table of actual vs predicted classes	Always. Reveals which classes the model confuses, guiding data collection and augmentation strategy
Calibration (ECE)	Expected Calibration Error	Safety-critical applications where model confidence must match true probability of correctness

Interview tip: Always ask "what is the cost of different types of errors?" before choosing a metric. In medical imaging, missing a tumor (false negative) is far worse than a false alarm (false positive), so you optimize for recall. In content moderation, blocking legitimate content (false positive) hurts user experience, so precision matters more.

← Previous CV Interview Overview Next → Object Detection Questions