CNN Fundamentals
A comprehensive guide to cnn fundamentals within the context of cnn architectures.
What Are Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a class of deep learning architectures specifically designed to process data with grid-like topology, such as images. Unlike fully connected networks where every neuron connects to every neuron in the next layer, CNNs use local connectivity patterns inspired by the visual cortex of animals. This architectural choice dramatically reduces the number of parameters while preserving the ability to learn spatial hierarchies of features.
A CNN learns to detect simple features like edges and corners in early layers, then combines them into more complex features like textures and shapes in middle layers, and finally recognizes high-level concepts like objects and faces in deeper layers. This hierarchical feature learning is what makes CNNs so powerful for vision tasks.
Core CNN Operations
Convolution
The convolution operation slides a small filter (kernel) across the input image, computing element-wise multiplication and summation at each position. Each filter learns to detect a specific feature regardless of where it appears in the image (translation invariance). A 3x3 filter has only 9 learnable parameters but can detect its pattern anywhere in a 1000x1000 image.
import torch
import torch.nn as nn
# A basic convolution layer
conv = nn.Conv2d(
in_channels=3, # RGB input
out_channels=64, # 64 different filters
kernel_size=3, # 3x3 filter
stride=1, # Move 1 pixel at a time
padding=1 # Pad to maintain spatial dimensions
)
# Input: (batch, 3, 224, 224)
# Output: (batch, 64, 224, 224)
Pooling
Pooling layers reduce the spatial dimensions of feature maps, providing translation invariance and reducing computation. Max pooling selects the maximum value in each window, while average pooling computes the mean. Modern architectures often replace pooling with strided convolutions, which learn the downsampling operation.
Activation Functions
ReLU (Rectified Linear Unit) is the standard activation for CNNs: f(x) = max(0, x). It introduces non-linearity, enables the network to learn complex functions, and avoids the vanishing gradient problem of sigmoid/tanh. Variants like Leaky ReLU, PReLU, and GELU are used in modern architectures.
CNN Architecture Template
Most CNN architectures follow a common template: alternating convolutional and pooling layers for feature extraction, followed by fully connected layers for classification. The spatial dimensions decrease through the network while the channel (depth) dimension increases, creating an increasingly abstract representation.
- Input layer — Raw image pixels (e.g., 224x224x3 for RGB)
- Convolutional blocks — Conv + BatchNorm + ReLU + Pooling, repeated multiple times
- Global average pooling — Reduces each feature map to a single number
- Classification head — One or more fully connected layers with softmax output
Receptive Field
The receptive field is the region of the input image that influences a particular neuron's output. Deeper layers have larger receptive fields because they combine information from multiple earlier layers. Understanding receptive fields is crucial for choosing kernel sizes and network depth. A network that needs to recognize large objects needs a large receptive field, which requires either deep networks, large kernels, or dilated convolutions.
Batch Normalization
Batch normalization normalizes the activations of each layer to have zero mean and unit variance. This stabilizes training, allows higher learning rates, and acts as a regularizer. It is applied after the convolution and before the activation function in most architectures. Despite its ubiquity, the theoretical understanding of why batch normalization works remains an active research area.
In the next lesson, we will trace the evolution of CNN architectures from LeNet in 1998 to AlexNet in 2012, which launched the deep learning revolution in computer vision.
Lilly Tech Systems