Beginner

Neural Networks

Learn the building blocks of deep learning: artificial neurons, activation functions, network layers, forward propagation, backpropagation, and gradient descent.

Artificial Neurons and Perceptrons

An artificial neuron is the basic unit of a neural network, inspired loosely by biological neurons. It takes multiple inputs, applies weights to each, sums them up, adds a bias, and passes the result through an activation function:

Mathematical Formula
# Neuron computation:
output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias)

# In vector notation:
output = activation(W · X + b)

The perceptron, invented by Frank Rosenblatt in 1958, was the first artificial neuron. It uses a step function as its activation: output is 1 if the weighted sum exceeds a threshold, 0 otherwise. While limited (it can only learn linearly separable patterns), it laid the foundation for modern neural networks.

Activation Functions

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. Without them, stacking layers would be equivalent to a single linear transformation.

Function Formula Range Use Case
ReLU max(0, x) [0, +inf) Default for hidden layers. Fast, avoids vanishing gradient.
Sigmoid 1 / (1 + e^(-x)) (0, 1) Binary classification output. Probability interpretation.
Tanh (e^x - e^(-x)) / (e^x + e^(-x)) (-1, 1) Zero-centered alternative to sigmoid. Used in RNNs.
Softmax e^(xi) / sum(e^(xj)) (0, 1), sums to 1 Multi-class classification output layer.
Rule of thumb: Use ReLU for hidden layers, sigmoid for binary classification outputs, and softmax for multi-class classification outputs. This covers 90% of use cases.

Layers: Input, Hidden, and Output

Neural networks are organized into layers:

  • Input layer: Receives the raw data. Each neuron represents one feature. No computation happens here — it simply passes data forward.
  • Hidden layers: The computational core. Each layer transforms its input using weights, biases, and activation functions. "Deep" networks have many hidden layers.
  • Output layer: Produces the final prediction. The number of neurons and activation function depend on the task (1 neuron + sigmoid for binary classification, N neurons + softmax for N-class classification).

Forward Propagation

Forward propagation is the process of passing input data through the network, layer by layer, to produce an output. Each layer computes:

  1. Multiply inputs by weights and add biases: z = W*x + b
  2. Apply activation function: a = activation(z)
  3. Pass the result as input to the next layer

This process repeats until the output layer produces a prediction.

Loss Functions

A loss function measures how far the network's predictions are from the true values. The goal of training is to minimize this loss:

  • Mean Squared Error (MSE): For regression tasks. Measures average squared difference between predictions and targets.
  • Cross-Entropy Loss: For classification tasks. Measures the difference between predicted probability distributions and true labels.
  • Binary Cross-Entropy: For binary classification. A special case of cross-entropy for two classes.

Backpropagation

Backpropagation is the algorithm that enables neural networks to learn. It computes how much each weight contributed to the error and adjusts weights accordingly:

  1. Forward pass

    Compute predictions by passing data through the network.

  2. Compute loss

    Measure the error between predictions and true labels using a loss function.

  3. Backward pass

    Compute the gradient of the loss with respect to each weight using the chain rule of calculus. Gradients flow backward from the output layer to the input layer.

  4. Update weights

    Adjust each weight in the direction that reduces the loss, proportional to its gradient.

Gradient Descent

Gradient descent is the optimization algorithm that uses the gradients computed by backpropagation to update the weights:

Weight Update Rule
# Gradient descent update:
weight = weight - learning_rate * gradient

# learning_rate controls the step size
# Too large: overshoots the minimum
# Too small: training is very slow

Variants of gradient descent:

  • Batch Gradient Descent: Uses the entire dataset to compute gradients. Stable but slow for large datasets.
  • Stochastic Gradient Descent (SGD): Uses one sample at a time. Fast but noisy updates.
  • Mini-batch Gradient Descent: Uses small batches (e.g., 32 or 64 samples). The practical standard — balances speed and stability.

Building a Neural Network with PyTorch

Let's build a simple feedforward neural network for classifying handwritten digits (MNIST):

Python (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define the neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(784, 128),   # Input: 28x28 = 784 pixels
            nn.ReLU(),                  # Activation function
            nn.Linear(128, 64),    # Hidden layer
            nn.ReLU(),
            nn.Linear(64, 10),     # Output: 10 digit classes
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.layers(x)

# Load MNIST dataset
transform = transforms.ToTensor()
train_data = datasets.MNIST('./data', train=True,
                            download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_data,
                                            batch_size=64,
                                            shuffle=True)

# Initialize model, loss, and optimizer
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    total_loss = 0
    for images, labels in train_loader:
        optimizer.zero_grad()           # Clear gradients
        outputs = model(images)         # Forward pass
        loss = criterion(outputs, labels) # Compute loss
        loss.backward()                 # Backpropagation
        optimizer.step()                # Update weights
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}")
📚
Code breakdown: This network has 784 input neurons (28x28 pixel images flattened), two hidden layers with 128 and 64 neurons (both using ReLU activation), and an output layer with 10 neurons (one per digit). Cross-entropy loss and the Adam optimizer handle the training.