Hyperparameter Tuning Advanced
Hyperparameters are the settings you choose before training begins: learning rate, batch size, network architecture, regularization strength. Unlike model parameters (learned by gradient descent), hyperparameters must be tuned through experimentation. The right hyperparameters can make the difference between a mediocre and a state-of-the-art model.
Key Hyperparameters
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| Learning rate | 1e-5 to 1e-1 | Most important; affects convergence speed and quality |
| Batch size | 16 to 512 | Affects generalization, training speed, memory |
| Weight decay | 1e-5 to 1e-1 | Regularization strength |
| Dropout rate | 0.1 to 0.5 | Prevents overfitting |
| Number of layers | Task-dependent | Model capacity |
Search Methods
Python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier import numpy as np # Grid Search: exhaustive but expensive param_grid = { 'n_estimators': [100, 200, 500], 'max_depth': [5, 10, 20, None], 'min_samples_split': [2, 5, 10] } grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) # Random Search: more efficient, better coverage param_distributions = { 'n_estimators': [100, 200, 500, 1000], 'max_depth': [5, 10, 20, 50, None], 'learning_rate': np.logspace(-4, -1, 20) # Log-uniform } random_search = RandomizedSearchCV( RandomForestClassifier(), param_distributions, n_iter=50, cv=5, random_state=42 )
Learning Rate Schedules
Rather than using a fixed learning rate, schedules adjust the rate during training:
Python
import torch.optim as optim # Cosine Annealing (popular for vision models) scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100) # Step decay: reduce LR every N epochs scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1) # Warmup + Cosine (standard for transformers) def warmup_cosine(step, warmup_steps=1000, total_steps=10000): if step < warmup_steps: return step / warmup_steps # Linear warmup progress = (step - warmup_steps) / (total_steps - warmup_steps) return 0.5 * (1 + np.cos(np.pi * progress)) # Cosine decay
Practical Advice: Start with random search over a log-uniform distribution for learning rate. Sample 20-50 configurations. Use the best result as a starting point, then do a finer search around it. For production, consider Bayesian optimization with Optuna or Weights & Biases Sweeps.
Next Up: Best Practices
Learn the training recipes used by practitioners: warmup, weight decay, gradient clipping, and debugging tips.
Next: Best Practices →
Lilly Tech Systems