Neural Networks
Learn the building blocks of deep learning: artificial neurons, activation functions, network layers, forward propagation, backpropagation, and gradient descent.
Artificial Neurons and Perceptrons
An artificial neuron is the basic unit of a neural network, inspired loosely by biological neurons. It takes multiple inputs, applies weights to each, sums them up, adds a bias, and passes the result through an activation function:
# Neuron computation: output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias) # In vector notation: output = activation(W · X + b)
The perceptron, invented by Frank Rosenblatt in 1958, was the first artificial neuron. It uses a step function as its activation: output is 1 if the weighted sum exceeds a threshold, 0 otherwise. While limited (it can only learn linearly separable patterns), it laid the foundation for modern neural networks.
Activation Functions
Activation functions introduce non-linearity into the network, allowing it to learn complex patterns. Without them, stacking layers would be equivalent to a single linear transformation.
| Function | Formula | Range | Use Case |
|---|---|---|---|
| ReLU | max(0, x) | [0, +inf) | Default for hidden layers. Fast, avoids vanishing gradient. |
| Sigmoid | 1 / (1 + e^(-x)) | (0, 1) | Binary classification output. Probability interpretation. |
| Tanh | (e^x - e^(-x)) / (e^x + e^(-x)) | (-1, 1) | Zero-centered alternative to sigmoid. Used in RNNs. |
| Softmax | e^(xi) / sum(e^(xj)) | (0, 1), sums to 1 | Multi-class classification output layer. |
Layers: Input, Hidden, and Output
Neural networks are organized into layers:
- Input layer: Receives the raw data. Each neuron represents one feature. No computation happens here — it simply passes data forward.
- Hidden layers: The computational core. Each layer transforms its input using weights, biases, and activation functions. "Deep" networks have many hidden layers.
- Output layer: Produces the final prediction. The number of neurons and activation function depend on the task (1 neuron + sigmoid for binary classification, N neurons + softmax for N-class classification).
Forward Propagation
Forward propagation is the process of passing input data through the network, layer by layer, to produce an output. Each layer computes:
- Multiply inputs by weights and add biases:
z = W*x + b - Apply activation function:
a = activation(z) - Pass the result as input to the next layer
This process repeats until the output layer produces a prediction.
Loss Functions
A loss function measures how far the network's predictions are from the true values. The goal of training is to minimize this loss:
- Mean Squared Error (MSE): For regression tasks. Measures average squared difference between predictions and targets.
- Cross-Entropy Loss: For classification tasks. Measures the difference between predicted probability distributions and true labels.
- Binary Cross-Entropy: For binary classification. A special case of cross-entropy for two classes.
Backpropagation
Backpropagation is the algorithm that enables neural networks to learn. It computes how much each weight contributed to the error and adjusts weights accordingly:
-
Forward pass
Compute predictions by passing data through the network.
-
Compute loss
Measure the error between predictions and true labels using a loss function.
-
Backward pass
Compute the gradient of the loss with respect to each weight using the chain rule of calculus. Gradients flow backward from the output layer to the input layer.
-
Update weights
Adjust each weight in the direction that reduces the loss, proportional to its gradient.
Gradient Descent
Gradient descent is the optimization algorithm that uses the gradients computed by backpropagation to update the weights:
# Gradient descent update: weight = weight - learning_rate * gradient # learning_rate controls the step size # Too large: overshoots the minimum # Too small: training is very slow
Variants of gradient descent:
- Batch Gradient Descent: Uses the entire dataset to compute gradients. Stable but slow for large datasets.
- Stochastic Gradient Descent (SGD): Uses one sample at a time. Fast but noisy updates.
- Mini-batch Gradient Descent: Uses small batches (e.g., 32 or 64 samples). The practical standard — balances speed and stability.
Building a Neural Network with PyTorch
Let's build a simple feedforward neural network for classifying handwritten digits (MNIST):
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms # Define the neural network class SimpleNN(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.layers = nn.Sequential( nn.Linear(784, 128), # Input: 28x28 = 784 pixels nn.ReLU(), # Activation function nn.Linear(128, 64), # Hidden layer nn.ReLU(), nn.Linear(64, 10), # Output: 10 digit classes ) def forward(self, x): x = self.flatten(x) return self.layers(x) # Load MNIST dataset transform = transforms.ToTensor() train_data = datasets.MNIST('./data', train=True, download=True, transform=transform) train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True) # Initialize model, loss, and optimizer model = SimpleNN() criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Training loop for epoch in range(5): total_loss = 0 for images, labels in train_loader: optimizer.zero_grad() # Clear gradients outputs = model(images) # Forward pass loss = criterion(outputs, labels) # Compute loss loss.backward() # Backpropagation optimizer.step() # Update weights total_loss += loss.item() print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}")
Lilly Tech Systems