Gradients Intermediate
The gradient is a vector of all partial derivatives. It points in the direction of steepest increase of a function. In ML, we move in the opposite direction of the gradient to minimize the loss — this is gradient descent, the engine of all neural network training.
The Gradient Vector
For a function f(w1, w2, ..., wn), the gradient is the vector of all partial derivatives:
Python
import numpy as np # Loss function: L(w1, w2) = w1^2 + w2^2 # Gradient: [dL/dw1, dL/dw2] = [2*w1, 2*w2] def loss(w): return w[0]**2 + w[1]**2 def gradient(w): return np.array([2*w[0], 2*w[1]]) # Gradient descent w = np.array([5.0, 3.0]) lr = 0.1 for i in range(50): grad = gradient(w) w = w - lr * grad # Move opposite to gradient print("Final w:", w) # Close to [0, 0] (the minimum)
Key Property: The gradient always points in the direction of steepest ascent. To minimize a function, move in the negative gradient direction. The magnitude of the gradient tells you how steep the slope is.
The Jacobian Matrix
When a function maps vectors to vectors (like a neural network layer), its derivatives form the Jacobian matrix:
Python
# f: R^2 -> R^2 # f(x1, x2) = [x1*x2, x1+x2^2] # Jacobian: # | df1/dx1 df1/dx2 | | x2 x1 | # | df2/dx1 df2/dx2 | = | 1 2*x2 | def jacobian(x1, x2): return np.array([[x2, x1], [1, 2*x2]])
The Hessian Matrix
The Hessian is the matrix of second derivatives. It describes the curvature of the loss surface and is used in second-order optimization methods:
| Concept | Contains | ML Use |
|---|---|---|
| Gradient | First derivatives | Direction of steepest descent |
| Jacobian | First derivatives (vector functions) | Layer-wise gradient propagation |
| Hessian | Second derivatives | Curvature, Newton's method, learning rate selection |
Practical Note: Computing the full Hessian is O(n²) in memory and O(n³) to invert, making it impractical for large networks. Approximations like L-BFGS or diagonal Hessian estimates are used instead.
Next Up: Chain Rule
The chain rule is what makes backpropagation possible. Learn how derivatives flow through composed functions.
Next: Chain Rule →
Lilly Tech Systems