Optimization Advanced

Optimization is where calculus meets practice. Every trained ML model is the result of an optimization algorithm minimizing a loss function. Understanding optimization helps you choose the right algorithm, set learning rates, and diagnose training issues.

Gradient Descent

The simplest optimization: repeatedly move in the negative gradient direction.

Python

import numpy as np

def gradient_descent(gradient_fn, x0, lr=0.01, n_steps=100):
    x = x0.copy()
    history = [x.copy()]
    for _ in range(n_steps):
        grad = gradient_fn(x)
        x = x - lr * grad
        history.append(x.copy())
    return x, history

# Minimize f(x,y) = x^2 + 2y^2
grad_fn = lambda w: np.array([2*w[0], 4*w[1]])
result, _ = gradient_descent(grad_fn, np.array([5.0, 3.0]), lr=0.1)
print("Minimum at:", result)  # Close to [0, 0]

Gradient Descent Variants

Variant	Batch Size	Pros	Cons
Batch GD	Full dataset	Stable convergence	Slow for large datasets
Stochastic GD	1 sample	Fast updates, escapes local minima	Noisy, unstable
Mini-batch GD	32-512 samples	Best of both worlds	Batch size is a hyperparameter

Learning Rate

The learning rate is the most important hyperparameter in optimization:

Learning Rate Effects:

Too large: Overshoots the minimum, loss diverges
Too small: Converges too slowly, may get stuck
Just right: Steady convergence to a good minimum

Challenges in Optimization

Local minima: Non-convex loss surfaces have many local minima. SGD noise helps escape shallow ones.
Saddle points: In high dimensions, saddle points are more common than local minima. Momentum helps pass through them.
Plateaus: Flat regions where gradients are near zero. Adaptive methods like Adam handle these well.
Ill-conditioning: When the loss surface is much steeper in some directions than others. Preconditioning or adaptive rates help.

Next Up: Best Practices

Learn practical tips for gradient checking, choosing learning rates, and avoiding common calculus pitfalls in ML.

Next: Best Practices →

← Chain Rule Best Practices →