Intermediate
SageMaker Training
Master model training on SageMaker with built-in algorithms, custom training scripts, distributed training, and cost optimization with spot instances.
Built-in Algorithms
SageMaker provides 17+ optimized, built-in algorithms ready to use without writing training code:
| Algorithm | Type | Use Case |
|---|---|---|
| XGBoost | Classification/Regression | Tabular data, feature-rich datasets |
| Linear Learner | Classification/Regression | Linear relationships, high-dimensional data |
| K-Nearest Neighbors | Classification/Regression | Similarity-based prediction |
| Image Classification | Computer Vision | Image categorization with ResNet |
| Object Detection | Computer Vision | Locating objects in images |
| BlazingText | NLP | Text classification, word embeddings |
| DeepAR | Time Series | Forecasting with autoregressive models |
| K-Means | Clustering | Unsupervised grouping |
Custom Training Jobs
For custom models, SageMaker supports bringing your own training scripts with popular frameworks:
- Script mode: Provide a Python training script, and SageMaker handles the infrastructure
- Framework containers: Pre-built Docker containers for TensorFlow, PyTorch, Scikit-learn, Hugging Face, and XGBoost
- Custom containers: Build your own Docker container with any framework or dependencies
- Input channels: Data is automatically downloaded from S3 to the training instance at
/opt/ml/input/data/ - Model output: Save model artifacts to
/opt/ml/model/and SageMaker uploads them to S3
Training workflow: SageMaker provisions instances, downloads your data from S3, runs your training script, saves the model to S3, and then terminates the instances. You only pay for the time the training job runs — billed per second.
Distributed Training
SageMaker simplifies distributed training across multiple instances and GPUs:
- Data parallelism: Split data across multiple GPUs/instances — each processes a subset and gradients are synchronized
- Model parallelism: Split large models across multiple GPUs when a model doesn't fit in a single GPU's memory
- SageMaker Distributed: Optimized libraries for both data and model parallelism with near-linear scaling
- Horovod support: Use Horovod for distributed TensorFlow and PyTorch training
- Multi-GPU instances: Use instances like ml.p3.16xlarge (8 V100 GPUs) or ml.p4d.24xlarge (8 A100 GPUs)
Hyperparameter Tuning
SageMaker Automatic Model Tuning (AMT) finds optimal hyperparameters:
- Bayesian optimization: Intelligently explores the hyperparameter space based on previous results
- Random search: Explore hyperparameters randomly for broad coverage
- Grid search: Exhaustively test all combinations of specified values
- Warm start: Continue tuning from previous tuning job results
- Early stopping: Automatically stop poorly-performing training jobs to save resources
Spot Instances
Managed Spot Training can reduce training costs by up to 90%:
- Automatic checkpointing: SageMaker saves training progress so jobs can resume if interrupted
- Transparent management: SageMaker handles spot instance acquisition and interruption automatically
- Max wait time: Set a maximum waiting time for spot capacity to become available
- Fallback: Optionally fall back to on-demand instances if spot isn't available within your time limit
Pro tip: Always enable managed spot training for non-urgent training jobs. Set
use_spot_instances=True and max_wait in your Estimator configuration. The savings are substantial and SageMaker handles all the complexity of checkpointing and resumption.
Lilly Tech Systems