Beginner

CNN Fundamentals

A comprehensive guide to cnn fundamentals within the context of cnn architectures.

What Are Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a class of deep learning architectures specifically designed to process data with grid-like topology, such as images. Unlike fully connected networks where every neuron connects to every neuron in the next layer, CNNs use local connectivity patterns inspired by the visual cortex of animals. This architectural choice dramatically reduces the number of parameters while preserving the ability to learn spatial hierarchies of features.

A CNN learns to detect simple features like edges and corners in early layers, then combines them into more complex features like textures and shapes in middle layers, and finally recognizes high-level concepts like objects and faces in deeper layers. This hierarchical feature learning is what makes CNNs so powerful for vision tasks.

Core CNN Operations

Convolution

The convolution operation slides a small filter (kernel) across the input image, computing element-wise multiplication and summation at each position. Each filter learns to detect a specific feature regardless of where it appears in the image (translation invariance). A 3x3 filter has only 9 learnable parameters but can detect its pattern anywhere in a 1000x1000 image.

import torch
import torch.nn as nn

# A basic convolution layer
conv = nn.Conv2d(
    in_channels=3,    # RGB input
    out_channels=64,  # 64 different filters
    kernel_size=3,    # 3x3 filter
    stride=1,         # Move 1 pixel at a time
    padding=1          # Pad to maintain spatial dimensions
)
# Input: (batch, 3, 224, 224)
# Output: (batch, 64, 224, 224)

Pooling

Pooling layers reduce the spatial dimensions of feature maps, providing translation invariance and reducing computation. Max pooling selects the maximum value in each window, while average pooling computes the mean. Modern architectures often replace pooling with strided convolutions, which learn the downsampling operation.

Activation Functions

ReLU (Rectified Linear Unit) is the standard activation for CNNs: f(x) = max(0, x). It introduces non-linearity, enables the network to learn complex functions, and avoids the vanishing gradient problem of sigmoid/tanh. Variants like Leaky ReLU, PReLU, and GELU are used in modern architectures.

💡
Parameter efficiency: A fully connected layer connecting a 224x224x3 image to 1000 neurons would need 150 million parameters. A 3x3 convolutional layer with 64 filters needs only 1,728 parameters. This is why CNNs dominate computer vision.

CNN Architecture Template

Most CNN architectures follow a common template: alternating convolutional and pooling layers for feature extraction, followed by fully connected layers for classification. The spatial dimensions decrease through the network while the channel (depth) dimension increases, creating an increasingly abstract representation.

  1. Input layer — Raw image pixels (e.g., 224x224x3 for RGB)
  2. Convolutional blocks — Conv + BatchNorm + ReLU + Pooling, repeated multiple times
  3. Global average pooling — Reduces each feature map to a single number
  4. Classification head — One or more fully connected layers with softmax output

Receptive Field

The receptive field is the region of the input image that influences a particular neuron's output. Deeper layers have larger receptive fields because they combine information from multiple earlier layers. Understanding receptive fields is crucial for choosing kernel sizes and network depth. A network that needs to recognize large objects needs a large receptive field, which requires either deep networks, large kernels, or dilated convolutions.

Common pitfall: If your receptive field is smaller than the objects you are trying to detect, the network cannot possibly learn to recognize them. Always calculate the effective receptive field of your architecture and ensure it covers the largest features you need to detect.

Batch Normalization

Batch normalization normalizes the activations of each layer to have zero mean and unit variance. This stabilizes training, allows higher learning rates, and acts as a regularizer. It is applied after the convolution and before the activation function in most architectures. Despite its ubiquity, the theoretical understanding of why batch normalization works remains an active research area.

In the next lesson, we will trace the evolution of CNN architectures from LeNet in 1998 to AlexNet in 2012, which launched the deep learning revolution in computer vision.