Gradient Descent Beginner

Gradient descent is the most important algorithm in machine learning. It is the method by which nearly every model — from logistic regression to GPT — learns its parameters. The idea is beautifully simple: compute the gradient of the loss, then take a step in the opposite direction.

The Algorithm

The update rule: θ = θ - α ∇L(θ) where α is the learning rate.

Python

import numpy as np

# Full implementation of mini-batch gradient descent
def mini_batch_gd(X, y, lr=0.01, batch_size=32, epochs=100):
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    b = 0.0
    losses = []

    for epoch in range(epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        for i in range(0, n_samples, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            # Forward pass
            predictions = X_batch @ w + b
            error = predictions - y_batch

            # Compute gradients
            dw = (2 / len(X_batch)) * X_batch.T @ error
            db = (2 / len(X_batch)) * np.sum(error)

            # Update parameters
            w -= lr * dw
            b -= lr * db

        loss = np.mean((X @ w + b - y) ** 2)
        losses.append(loss)

    return w, b, losses

Momentum

Momentum accelerates gradient descent by accumulating a velocity vector that smooths out oscillations and speeds up movement along consistent gradient directions:

Python

def sgd_momentum(gradient_fn, x0, lr=0.01, momentum=0.9, n_steps=100):
    x = x0.copy()
    velocity = np.zeros_like(x)

    for _ in range(n_steps):
        grad = gradient_fn(x)
        velocity = momentum * velocity - lr * grad  # Accumulate
        x = x + velocity                            # Update

    return x

Learning Rate Effects

Learning Rate	Behavior	Typical Range
Too large (>0.1)	Diverges, loss increases or oscillates wildly	Rarely used without warmup
Large (0.01-0.1)	Fast convergence but may overshoot	SGD with momentum
Medium (1e-3 to 1e-2)	Good balance of speed and stability	Adam default: 1e-3
Small (<1e-4)	Very stable but slow convergence	Fine-tuning pretrained models

Learning Rate Finder: Start with a very small learning rate and gradually increase it while tracking the loss. The best learning rate is typically just before the loss starts increasing rapidly. This technique was popularized by Leslie Smith and is built into libraries like fastai.

Next Up: Adam & Optimizers

Learn about adaptive optimizers that automatically adjust learning rates for each parameter.

Next: Adam & Optimizers →

← Introduction Adam & Optimizers →