Introduction to Optimization for ML Beginner
Optimization is the engine that powers all machine learning. When we say a model "learns," what we really mean is that an optimization algorithm adjusts the model's parameters to minimize a loss function. Understanding optimization is understanding how ML actually works.
The Optimization Problem in ML
Every ML training process solves the same fundamental problem: find parameters θ that minimize a loss function L(θ):
import numpy as np # The ML optimization problem in pseudocode: # theta* = argmin_theta L(theta) # = argmin_theta (1/N) * sum(loss(model(x_i, theta), y_i)) # In practice with PyTorch: # for epoch in range(num_epochs): # for batch in dataloader: # predictions = model(batch.x) # Forward pass # loss = criterion(predictions, batch.y) # Compute loss # loss.backward() # Compute gradients # optimizer.step() # Update parameters # optimizer.zero_grad() # Reset gradients
Key Challenges
| Challenge | Description | Solutions |
|---|---|---|
| Non-convexity | Loss surfaces of neural networks have many local minima and saddle points | SGD noise, momentum, large batch training |
| High dimensionality | Modern models have billions of parameters | First-order methods (no Hessian needed) |
| Noisy gradients | Mini-batch gradients are noisy estimates of the true gradient | Momentum, adaptive learning rates |
| Ill-conditioning | Loss surface curvature varies dramatically across dimensions | Adam, preconditioning, normalization |
| Generalization | Optimizing training loss does not guarantee good test performance | Early stopping, regularization, dropout |
Course Roadmap
-
Gradient Descent
The foundational algorithm and its variants: batch, stochastic, mini-batch, and with momentum.
-
Modern Optimizers
Adaptive methods (Adam, AdaGrad, RMSProp) that automatically tune learning rates per parameter.
-
Convex Optimization
The theoretical foundation: when can we guarantee finding the global minimum?
-
Hyperparameter Tuning
Systematic methods for finding the best training configuration.
Ready to Begin?
Let's start with the algorithm that started it all: gradient descent.
Next: Gradient Descent →
Lilly Tech Systems