Gradients Intermediate

The gradient is a vector of all partial derivatives. It points in the direction of steepest increase of a function. In ML, we move in the opposite direction of the gradient to minimize the loss — this is gradient descent, the engine of all neural network training.

The Gradient Vector

For a function f(w1, w2, ..., wn), the gradient is the vector of all partial derivatives:

Python
import numpy as np

# Loss function: L(w1, w2) = w1^2 + w2^2
# Gradient: [dL/dw1, dL/dw2] = [2*w1, 2*w2]

def loss(w):
    return w[0]**2 + w[1]**2

def gradient(w):
    return np.array([2*w[0], 2*w[1]])

# Gradient descent
w = np.array([5.0, 3.0])
lr = 0.1

for i in range(50):
    grad = gradient(w)
    w = w - lr * grad  # Move opposite to gradient

print("Final w:", w)  # Close to [0, 0] (the minimum)
Key Property: The gradient always points in the direction of steepest ascent. To minimize a function, move in the negative gradient direction. The magnitude of the gradient tells you how steep the slope is.

The Jacobian Matrix

When a function maps vectors to vectors (like a neural network layer), its derivatives form the Jacobian matrix:

Python
# f: R^2 -> R^2
# f(x1, x2) = [x1*x2, x1+x2^2]
# Jacobian:
# | df1/dx1  df1/dx2 |   | x2    x1   |
# | df2/dx1  df2/dx2 | = | 1     2*x2 |

def jacobian(x1, x2):
    return np.array([[x2, x1],
                     [1, 2*x2]])

The Hessian Matrix

The Hessian is the matrix of second derivatives. It describes the curvature of the loss surface and is used in second-order optimization methods:

Concept Contains ML Use
Gradient First derivatives Direction of steepest descent
Jacobian First derivatives (vector functions) Layer-wise gradient propagation
Hessian Second derivatives Curvature, Newton's method, learning rate selection
Practical Note: Computing the full Hessian is O(n²) in memory and O(n³) to invert, making it impractical for large networks. Approximations like L-BFGS or diagonal Hessian estimates are used instead.

Next Up: Chain Rule

The chain rule is what makes backpropagation possible. Learn how derivatives flow through composed functions.

Next: Chain Rule →