Hyperparameter Tuning Advanced

Hyperparameters are the settings you choose before training begins: learning rate, batch size, network architecture, regularization strength. Unlike model parameters (learned by gradient descent), hyperparameters must be tuned through experimentation. The right hyperparameters can make the difference between a mediocre and a state-of-the-art model.

Key Hyperparameters

Hyperparameter	Typical Range	Impact
Learning rate	1e-5 to 1e-1	Most important; affects convergence speed and quality
Batch size	16 to 512	Affects generalization, training speed, memory
Weight decay	1e-5 to 1e-1	Regularization strength
Dropout rate	0.1 to 0.5	Prevents overfitting
Number of layers	Task-dependent	Model capacity

Search Methods

Python

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Grid Search: exhaustive but expensive
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)

# Random Search: more efficient, better coverage
param_distributions = {
    'n_estimators': [100, 200, 500, 1000],
    'max_depth': [5, 10, 20, 50, None],
    'learning_rate': np.logspace(-4, -1, 20)  # Log-uniform
}
random_search = RandomizedSearchCV(
    RandomForestClassifier(), param_distributions,
    n_iter=50, cv=5, random_state=42
)

Learning Rate Schedules

Rather than using a fixed learning rate, schedules adjust the rate during training:

Python

import torch.optim as optim

# Cosine Annealing (popular for vision models)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Step decay: reduce LR every N epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

# Warmup + Cosine (standard for transformers)
def warmup_cosine(step, warmup_steps=1000, total_steps=10000):
    if step < warmup_steps:
        return step / warmup_steps  # Linear warmup
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return 0.5 * (1 + np.cos(np.pi * progress))  # Cosine decay

Practical Advice: Start with random search over a log-uniform distribution for learning rate. Sample 20-50 configurations. Use the best result as a starting point, then do a finer search around it. For production, consider Bayesian optimization with Optuna or Weights & Biases Sweeps.

Next Up: Best Practices

Learn the training recipes used by practitioners: warmup, weight decay, gradient clipping, and debugging tips.

Next: Best Practices →

← Convex Optimization Best Practices →