Gradient Descent Beginner
Gradient descent is the most important algorithm in machine learning. It is the method by which nearly every model — from logistic regression to GPT — learns its parameters. The idea is beautifully simple: compute the gradient of the loss, then take a step in the opposite direction.
The Algorithm
The update rule: θ = θ - α ∇L(θ) where α is the learning rate.
Python
import numpy as np # Full implementation of mini-batch gradient descent def mini_batch_gd(X, y, lr=0.01, batch_size=32, epochs=100): n_samples, n_features = X.shape w = np.zeros(n_features) b = 0.0 losses = [] for epoch in range(epochs): # Shuffle data each epoch indices = np.random.permutation(n_samples) X_shuffled = X[indices] y_shuffled = y[indices] for i in range(0, n_samples, batch_size): X_batch = X_shuffled[i:i+batch_size] y_batch = y_shuffled[i:i+batch_size] # Forward pass predictions = X_batch @ w + b error = predictions - y_batch # Compute gradients dw = (2 / len(X_batch)) * X_batch.T @ error db = (2 / len(X_batch)) * np.sum(error) # Update parameters w -= lr * dw b -= lr * db loss = np.mean((X @ w + b - y) ** 2) losses.append(loss) return w, b, losses
Momentum
Momentum accelerates gradient descent by accumulating a velocity vector that smooths out oscillations and speeds up movement along consistent gradient directions:
Python
def sgd_momentum(gradient_fn, x0, lr=0.01, momentum=0.9, n_steps=100): x = x0.copy() velocity = np.zeros_like(x) for _ in range(n_steps): grad = gradient_fn(x) velocity = momentum * velocity - lr * grad # Accumulate x = x + velocity # Update return x
Learning Rate Effects
| Learning Rate | Behavior | Typical Range |
|---|---|---|
| Too large (>0.1) | Diverges, loss increases or oscillates wildly | Rarely used without warmup |
| Large (0.01-0.1) | Fast convergence but may overshoot | SGD with momentum |
| Medium (1e-3 to 1e-2) | Good balance of speed and stability | Adam default: 1e-3 |
| Small (<1e-4) | Very stable but slow convergence | Fine-tuning pretrained models |
Learning Rate Finder: Start with a very small learning rate and gradually increase it while tracking the loss. The best learning rate is typically just before the loss starts increasing rapidly. This technique was popularized by Leslie Smith and is built into libraries like fastai.
Next Up: Adam & Optimizers
Learn about adaptive optimizers that automatically adjust learning rates for each parameter.
Next: Adam & Optimizers →
Lilly Tech Systems