Neural Networks Deep Dive
The foundation of modern deep learning — from the simplest perceptron to multi-layer networks that power image recognition, language models, and beyond.
Biological Inspiration
Neural networks are loosely inspired by the human brain. A biological neuron receives signals from other neurons through dendrites, processes them in the cell body, and sends output through its axon. An artificial neuron does something similar:
Biological Neuron → Artificial Neuron
─────────────────────────────────────────────────
Dendrites (inputs) → Input features (x1, x2, ..., xn)
Synaptic weights → Learnable weights (w1, w2, ..., wn)
Cell body (processing) → Weighted sum + bias: z = SUM(wi*xi) + b
Firing threshold → Activation function: a = f(z)
Axon (output) → Output value sent to next layer
The Perceptron
The simplest neural network: a single neuron that makes binary decisions.
Perceptron:
z = w1*x1 + w2*x2 + ... + wn*xn + b
output = 1 if z >= 0, else 0
Limitations:
- Can only learn LINEAR decision boundaries
- Cannot solve XOR problem
- This limitation motivated multi-layer networks
Multi-Layer Perceptron (MLP)
Stack multiple layers of neurons to learn complex, non-linear patterns:
Architecture:
Input Layer → Hidden Layer(s) → Output Layer
(features) (learned repr.) (predictions)
Example (for classification of 784-pixel images into 10 digits):
Input: 784 neurons (one per pixel)
Hidden 1: 256 neurons (ReLU activation)
Hidden 2: 128 neurons (ReLU activation)
Output: 10 neurons (Softmax activation)
Each connection has a learnable weight.
Each neuron has a learnable bias.
Activation Functions
Activation functions introduce non-linearity. Without them, stacking layers would still produce a linear model.
| Function | Formula | Range | Use Case | Pros/Cons |
|---|---|---|---|---|
| ReLU | max(0, z) | [0, inf) | Hidden layers (default) | Fast, no vanishing gradient. Dead neurons possible. |
| Sigmoid | 1/(1+e^(-z)) | (0, 1) | Binary output layer | Outputs probabilities. Vanishing gradient for large |z|. |
| Tanh | (e^z - e^(-z))/(e^z + e^(-z)) | (-1, 1) | Hidden layers (older networks) | Zero-centered. Still has vanishing gradient. |
| Softmax | e^(z_i) / SUM(e^(z_j)) | (0, 1), sums to 1 | Multi-class output layer | Outputs probability distribution over classes. |
| Leaky ReLU | max(0.01z, z) | (-inf, inf) | Hidden layers | Fixes dead neuron problem. Small negative slope. |
| GELU | z * Phi(z) | approx (-0.17, inf) | Transformers (modern) | Smooth approximation of ReLU. Used in BERT, GPT. |
Forward Propagation
Data flows forward through the network, layer by layer, to produce a prediction:
# For a 2-hidden-layer network:
# Layer 1: Input → Hidden 1
z1 = W1 @ X + b1 # linear transformation
a1 = relu(z1) # activation
# Layer 2: Hidden 1 → Hidden 2
z2 = W2 @ a1 + b2 # linear transformation
a2 = relu(z2) # activation
# Layer 3: Hidden 2 → Output
z3 = W3 @ a2 + b3 # linear transformation
y_hat = softmax(z3) # output probabilities
# Compute loss
loss = cross_entropy(y_true, y_hat)
Backpropagation and the Chain Rule
Backpropagation computes how much each weight contributed to the error, using the chain rule of calculus to propagate gradients backward:
Chain Rule (simplified):
dLoss/dW1 = dLoss/dy_hat * dy_hat/dz3 * dz3/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dW1
This chains together:
1. How loss changes with output (dLoss/dy_hat)
2. How output changes with z3 (dy_hat/dz3) - softmax derivative
3. How z3 changes with a2 (dz3/da2) = W3
4. How a2 changes with z2 (da2/dz2) - ReLU derivative (0 or 1)
5. ... all the way back to W1
Steps:
1. Forward pass: compute all z's and a's
2. Compute loss
3. Backward pass: compute all gradients using chain rule
4. Update weights: W = W - learning_rate * gradient
Loss Functions
| Task | Loss Function | Formula |
|---|---|---|
| Regression | MSE (Mean Squared Error) | (1/n) * SUM(y - y_hat)^2 |
| Binary Classification | Binary Cross-Entropy | -[y*log(p) + (1-y)*log(1-p)] |
| Multi-class Classification | Categorical Cross-Entropy | -SUM[y_k * log(p_k)] |
Optimizers
| Optimizer | How It Works | When to Use |
|---|---|---|
| SGD | Basic gradient descent with optional momentum | Simple problems, when you want full control |
| SGD + Momentum | Adds velocity to escape local minima | Better convergence than vanilla SGD |
| Adam | Adaptive learning rates per parameter + momentum | Default choice. Works well for most problems. |
| AdamW | Adam with decoupled weight decay | Modern NLP/vision transformers |
Key Concepts: Epochs, Batches, Learning Rate
Epoch:
One complete pass through the entire training dataset.
Typical: 10-100 epochs (with early stopping).
Batch Size:
Number of samples processed before updating weights.
- Batch GD: entire dataset (slow but stable)
- Mini-batch GD: 32-256 samples (best tradeoff)
- Stochastic GD: 1 sample (noisy but fast)
Common choices: 32, 64, 128, 256
Learning Rate:
How big each weight update step is.
- Too high: loss diverges (overshooting)
- Too low: training takes forever
- Common: 1e-3 (0.001) with Adam
- Use learning rate schedulers to decrease over time
Universal Approximation Theorem
Complete PyTorch Training Loop
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
# --- Data Preparation ---
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1) # shape: (n, 1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)
# Create DataLoader for mini-batch training
train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# --- Define the Neural Network ---
class NeuralNet(nn.Module):
def __init__(self, input_size):
super(NeuralNet, self).__init__()
self.network = nn.Sequential(
nn.Linear(input_size, 64), # Hidden layer 1
nn.ReLU(),
nn.Dropout(0.3), # Regularization
nn.Linear(64, 32), # Hidden layer 2
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, 1), # Output layer
nn.Sigmoid() # Binary classification
)
def forward(self, x):
return self.network(x)
# --- Initialize ---
model = NeuralNet(input_size=X_train.shape[1])
criterion = nn.BCELoss() # Binary Cross-Entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)
print(f"Model architecture:\n{model}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
# --- Training Loop ---
num_epochs = 100
train_losses = []
test_losses = []
for epoch in range(num_epochs):
model.train()
epoch_loss = 0
for batch_X, batch_y in train_loader:
# Forward pass
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
# Backward pass
optimizer.zero_grad() # clear previous gradients
loss.backward() # compute gradients (backpropagation)
optimizer.step() # update weights
epoch_loss += loss.item()
# Track losses
avg_train_loss = epoch_loss / len(train_loader)
train_losses.append(avg_train_loss)
# Evaluate on test set
model.eval()
with torch.no_grad():
test_outputs = model(X_test_t)
test_loss = criterion(test_outputs, y_test_t).item()
test_losses.append(test_loss)
if (epoch + 1) % 20 == 0:
print(f"Epoch [{epoch+1}/{num_epochs}] "
f"Train Loss: {avg_train_loss:.4f} "
f"Test Loss: {test_loss:.4f}")
# --- Evaluation ---
model.eval()
with torch.no_grad():
predictions = model(X_test_t)
predicted_classes = (predictions >= 0.5).float()
accuracy = (predicted_classes == y_test_t).float().mean()
print(f"\nTest Accuracy: {accuracy:.4f}")
# --- Plot Training Curves ---
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(train_losses, label='Train Loss')
plt.plot(test_losses, label='Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (Binary Cross-Entropy)')
plt.title('Training and Test Loss Over Epochs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Lilly Tech Systems