Optimization Advanced

Optimization is where calculus meets practice. Every trained ML model is the result of an optimization algorithm minimizing a loss function. Understanding optimization helps you choose the right algorithm, set learning rates, and diagnose training issues.

Gradient Descent

The simplest optimization: repeatedly move in the negative gradient direction.

Python
import numpy as np

def gradient_descent(gradient_fn, x0, lr=0.01, n_steps=100):
    x = x0.copy()
    history = [x.copy()]
    for _ in range(n_steps):
        grad = gradient_fn(x)
        x = x - lr * grad
        history.append(x.copy())
    return x, history

# Minimize f(x,y) = x^2 + 2y^2
grad_fn = lambda w: np.array([2*w[0], 4*w[1]])
result, _ = gradient_descent(grad_fn, np.array([5.0, 3.0]), lr=0.1)
print("Minimum at:", result)  # Close to [0, 0]

Gradient Descent Variants

Variant Batch Size Pros Cons
Batch GD Full dataset Stable convergence Slow for large datasets
Stochastic GD 1 sample Fast updates, escapes local minima Noisy, unstable
Mini-batch GD 32-512 samples Best of both worlds Batch size is a hyperparameter

Learning Rate

The learning rate is the most important hyperparameter in optimization:

Learning Rate Effects:
  • Too large: Overshoots the minimum, loss diverges
  • Too small: Converges too slowly, may get stuck
  • Just right: Steady convergence to a good minimum

Challenges in Optimization

  • Local minima: Non-convex loss surfaces have many local minima. SGD noise helps escape shallow ones.
  • Saddle points: In high dimensions, saddle points are more common than local minima. Momentum helps pass through them.
  • Plateaus: Flat regions where gradients are near zero. Adaptive methods like Adam handle these well.
  • Ill-conditioning: When the loss surface is much steeper in some directions than others. Preconditioning or adaptive rates help.

Next Up: Best Practices

Learn practical tips for gradient checking, choosing learning rates, and avoiding common calculus pitfalls in ML.

Next: Best Practices →