Intermediate

SageMaker Training

Master model training on SageMaker with built-in algorithms, custom training scripts, distributed training, and cost optimization with spot instances.

Built-in Algorithms

SageMaker provides 17+ optimized, built-in algorithms ready to use without writing training code:

Algorithm	Type	Use Case
XGBoost	Classification/Regression	Tabular data, feature-rich datasets
Linear Learner	Classification/Regression	Linear relationships, high-dimensional data
K-Nearest Neighbors	Classification/Regression	Similarity-based prediction
Image Classification	Computer Vision	Image categorization with ResNet
Object Detection	Computer Vision	Locating objects in images
BlazingText	NLP	Text classification, word embeddings
DeepAR	Time Series	Forecasting with autoregressive models
K-Means	Clustering	Unsupervised grouping

Custom Training Jobs

For custom models, SageMaker supports bringing your own training scripts with popular frameworks:

Script mode: Provide a Python training script, and SageMaker handles the infrastructure
Framework containers: Pre-built Docker containers for TensorFlow, PyTorch, Scikit-learn, Hugging Face, and XGBoost
Custom containers: Build your own Docker container with any framework or dependencies
Input channels: Data is automatically downloaded from S3 to the training instance at /opt/ml/input/data/
Model output: Save model artifacts to /opt/ml/model/ and SageMaker uploads them to S3

💡

Training workflow: SageMaker provisions instances, downloads your data from S3, runs your training script, saves the model to S3, and then terminates the instances. You only pay for the time the training job runs — billed per second.

Distributed Training

SageMaker simplifies distributed training across multiple instances and GPUs:

Data parallelism: Split data across multiple GPUs/instances — each processes a subset and gradients are synchronized
Model parallelism: Split large models across multiple GPUs when a model doesn't fit in a single GPU's memory
SageMaker Distributed: Optimized libraries for both data and model parallelism with near-linear scaling
Horovod support: Use Horovod for distributed TensorFlow and PyTorch training
Multi-GPU instances: Use instances like ml.p3.16xlarge (8 V100 GPUs) or ml.p4d.24xlarge (8 A100 GPUs)

Hyperparameter Tuning

SageMaker Automatic Model Tuning (AMT) finds optimal hyperparameters:

Bayesian optimization: Intelligently explores the hyperparameter space based on previous results
Random search: Explore hyperparameters randomly for broad coverage
Grid search: Exhaustively test all combinations of specified values
Warm start: Continue tuning from previous tuning job results
Early stopping: Automatically stop poorly-performing training jobs to save resources

Spot Instances

Managed Spot Training can reduce training costs by up to 90%:

Automatic checkpointing: SageMaker saves training progress so jobs can resume if interrupted
Transparent management: SageMaker handles spot instance acquisition and interruption automatically
Max wait time: Set a maximum waiting time for spot capacity to become available
Fallback: Optionally fall back to on-demand instances if spot isn't available within your time limit

✅

Pro tip: Always enable managed spot training for non-urgent training jobs. Set use_spot_instances=True and max_wait in your Estimator configuration. The savings are substantial and SageMaker handles all the complexity of checkpointing and resumption.

← Previous Notebooks Next → Deployment