Advanced

Neural Networks Deep Dive

The foundation of modern deep learning — from the simplest perceptron to multi-layer networks that power image recognition, language models, and beyond.

Biological Inspiration

Neural networks are loosely inspired by the human brain. A biological neuron receives signals from other neurons through dendrites, processes them in the cell body, and sends output through its axon. An artificial neuron does something similar:

Biological Neuron          →  Artificial Neuron
─────────────────────────────────────────────────
Dendrites (inputs)         →  Input features (x1, x2, ..., xn)
Synaptic weights           →  Learnable weights (w1, w2, ..., wn)
Cell body (processing)     →  Weighted sum + bias: z = SUM(wi*xi) + b
Firing threshold           →  Activation function: a = f(z)
Axon (output)              →  Output value sent to next layer

The Perceptron

The simplest neural network: a single neuron that makes binary decisions.

Perceptron:
  z = w1*x1 + w2*x2 + ... + wn*xn + b
  output = 1 if z >= 0, else 0

Limitations:
  - Can only learn LINEAR decision boundaries
  - Cannot solve XOR problem
  - This limitation motivated multi-layer networks

Multi-Layer Perceptron (MLP)

Stack multiple layers of neurons to learn complex, non-linear patterns:

Architecture:
  Input Layer  →  Hidden Layer(s)  →  Output Layer
  (features)      (learned repr.)     (predictions)

Example (for classification of 784-pixel images into 10 digits):
  Input:    784 neurons (one per pixel)
  Hidden 1: 256 neurons (ReLU activation)
  Hidden 2: 128 neurons (ReLU activation)
  Output:   10 neurons  (Softmax activation)

Each connection has a learnable weight.
Each neuron has a learnable bias.

Activation Functions

Activation functions introduce non-linearity. Without them, stacking layers would still produce a linear model.

Function	Formula	Range	Use Case	Pros/Cons
ReLU	max(0, z)	[0, inf)	Hidden layers (default)	Fast, no vanishing gradient. Dead neurons possible.
Sigmoid	1/(1+e^(-z))	(0, 1)	Binary output layer	Outputs probabilities. Vanishing gradient for large \|z\|.
Tanh	(e^z - e^(-z))/(e^z + e^(-z))	(-1, 1)	Hidden layers (older networks)	Zero-centered. Still has vanishing gradient.
Softmax	e^(z_i) / SUM(e^(z_j))	(0, 1), sums to 1	Multi-class output layer	Outputs probability distribution over classes.
Leaky ReLU	max(0.01z, z)	(-inf, inf)	Hidden layers	Fixes dead neuron problem. Small negative slope.
GELU	z * Phi(z)	approx (-0.17, inf)	Transformers (modern)	Smooth approximation of ReLU. Used in BERT, GPT.

💡

Rule of thumb: Use ReLU for hidden layers (or Leaky ReLU if you have dead neuron issues). Use Sigmoid for binary classification output. Use Softmax for multi-class output. Use no activation (linear) for regression output.

Forward Propagation

Data flows forward through the network, layer by layer, to produce a prediction:

# For a 2-hidden-layer network:

# Layer 1: Input → Hidden 1
z1 = W1 @ X + b1         # linear transformation
a1 = relu(z1)             # activation

# Layer 2: Hidden 1 → Hidden 2
z2 = W2 @ a1 + b2         # linear transformation
a2 = relu(z2)             # activation

# Layer 3: Hidden 2 → Output
z3 = W3 @ a2 + b3         # linear transformation
y_hat = softmax(z3)       # output probabilities

# Compute loss
loss = cross_entropy(y_true, y_hat)

Backpropagation and the Chain Rule

Backpropagation computes how much each weight contributed to the error, using the chain rule of calculus to propagate gradients backward:

Chain Rule (simplified):
  dLoss/dW1 = dLoss/dy_hat * dy_hat/dz3 * dz3/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dW1

This chains together:
  1. How loss changes with output  (dLoss/dy_hat)
  2. How output changes with z3    (dy_hat/dz3) - softmax derivative
  3. How z3 changes with a2        (dz3/da2) = W3
  4. How a2 changes with z2        (da2/dz2) - ReLU derivative (0 or 1)
  5. ... all the way back to W1

Steps:
  1. Forward pass: compute all z's and a's
  2. Compute loss
  3. Backward pass: compute all gradients using chain rule
  4. Update weights: W = W - learning_rate * gradient

Loss Functions

Task	Loss Function	Formula
Regression	MSE (Mean Squared Error)	(1/n) * SUM(y - y_hat)^2
Binary Classification	Binary Cross-Entropy	-[ylog(p) + (1-y)log(1-p)]
Multi-class Classification	Categorical Cross-Entropy	-SUM[y_k * log(p_k)]

Optimizers

Optimizer	How It Works	When to Use
SGD	Basic gradient descent with optional momentum	Simple problems, when you want full control
SGD + Momentum	Adds velocity to escape local minima	Better convergence than vanilla SGD
Adam	Adaptive learning rates per parameter + momentum	Default choice. Works well for most problems.
AdamW	Adam with decoupled weight decay	Modern NLP/vision transformers

Key Concepts: Epochs, Batches, Learning Rate

Epoch:
  One complete pass through the entire training dataset.
  Typical: 10-100 epochs (with early stopping).

Batch Size:
  Number of samples processed before updating weights.
  - Batch GD: entire dataset (slow but stable)
  - Mini-batch GD: 32-256 samples (best tradeoff)
  - Stochastic GD: 1 sample (noisy but fast)
  Common choices: 32, 64, 128, 256

Learning Rate:
  How big each weight update step is.
  - Too high: loss diverges (overshooting)
  - Too low: training takes forever
  - Common: 1e-3 (0.001) with Adam
  - Use learning rate schedulers to decrease over time

Universal Approximation Theorem

✅

Universal Approximation Theorem: A neural network with just one hidden layer and a sufficient number of neurons can approximate any continuous function to arbitrary precision. This is why neural networks are so powerful — they're universal function approximators. In practice, deeper networks (more layers, fewer neurons per layer) are more efficient than very wide single-layer networks.

Complete PyTorch Training Loop

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# --- Data Preparation ---
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1)  # shape: (n, 1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)

# Create DataLoader for mini-batch training
train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# --- Define the Neural Network ---
class NeuralNet(nn.Module):
    def __init__(self, input_size):
        super(NeuralNet, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_size, 64),   # Hidden layer 1
            nn.ReLU(),
            nn.Dropout(0.3),             # Regularization
            nn.Linear(64, 32),           # Hidden layer 2
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(32, 1),            # Output layer
            nn.Sigmoid()                 # Binary classification
        )

    def forward(self, x):
        return self.network(x)

# --- Initialize ---
model = NeuralNet(input_size=X_train.shape[1])
criterion = nn.BCELoss()                  # Binary Cross-Entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)

print(f"Model architecture:\n{model}")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

# --- Training Loop ---
num_epochs = 100
train_losses = []
test_losses = []

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0

    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass
        optimizer.zero_grad()   # clear previous gradients
        loss.backward()         # compute gradients (backpropagation)
        optimizer.step()        # update weights

        epoch_loss += loss.item()

    # Track losses
    avg_train_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_train_loss)

    # Evaluate on test set
    model.eval()
    with torch.no_grad():
        test_outputs = model(X_test_t)
        test_loss = criterion(test_outputs, y_test_t).item()
        test_losses.append(test_loss)

    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}] "
              f"Train Loss: {avg_train_loss:.4f} "
              f"Test Loss: {test_loss:.4f}")

# --- Evaluation ---
model.eval()
with torch.no_grad():
    predictions = model(X_test_t)
    predicted_classes = (predictions >= 0.5).float()
    accuracy = (predicted_classes == y_test_t).float().mean()
    print(f"\nTest Accuracy: {accuracy:.4f}")

# --- Plot Training Curves ---
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(train_losses, label='Train Loss')
plt.plot(test_losses, label='Test Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss (Binary Cross-Entropy)')
plt.title('Training and Test Loss Over Epochs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

⚠

Key considerations: Neural networks need (1) scaled features (always standardize), (2) enough data (thousands+ samples), (3) GPU acceleration for large models, (4) regularization (dropout, weight decay) to prevent overfitting. For small tabular datasets, gradient boosting almost always outperforms neural networks.

← Previous Gradient Boosting Next → Graph Neural Networks