Model Development
This lesson covers building and training ML models on Google Cloud, including framework selection, Vertex AI Training configuration, hyperparameter tuning, and distributed training strategies — all heavily tested on the exam.
Framework Selection on GCP
The exam tests whether you can choose the right framework for the task. Here is the decision guide:
| Framework | Best For | GCP Integration |
|---|---|---|
| TensorFlow / Keras | Production ML, deep learning, TPU training | Native GCP support, TFX pipelines, SavedModel format |
| PyTorch | Research, NLP, computer vision prototyping | Vertex AI custom containers, TorchServe |
| XGBoost / scikit-learn | Tabular data, classical ML, fast iteration | Pre-built Vertex AI containers, BQML integration |
| JAX | High-performance numerical computing, custom gradients | TPU-native, used internally at Google |
Vertex AI Training
Vertex AI Training is the primary service for training custom models. Know these configuration options:
Pre-built Containers
Google provides pre-built Docker containers for common frameworks. Use these when possible to minimize setup:
- TensorFlow (CPU/GPU), PyTorch (CPU/GPU), XGBoost, scikit-learn
- Containers include the framework, CUDA drivers, and Vertex AI SDK pre-installed
- You provide your training script as a Python package
Custom Containers
Use custom containers when you need specific dependencies or non-standard frameworks:
- Build your own Docker image and push to Artifact Registry
- Must accept
AIP_MODEL_DIRenvironment variable for model output - Must accept
AIP_TRAINING_DATA_URIfor data input location
Machine Types and Accelerators
| Accelerator | Best For | Cost Level |
|---|---|---|
| CPU only (n1-standard) | Small models, tabular data, XGBoost | $ |
| NVIDIA T4 GPU | Inference, small-medium DL models | $$ |
| NVIDIA V100 GPU | Medium-large DL training | $$$ |
| NVIDIA A100 GPU | Large model training, multi-GPU | $$$$ |
| TPU v3 / v4 | Very large models, TensorFlow, JAX | $$$$ |
Hyperparameter Tuning with Vertex AI Vizier
Vertex AI supports automated hyperparameter tuning (HPT). Key concepts for the exam:
Tuning Algorithms
- Bayesian optimization: Default and most efficient for small parameter spaces
- Grid search: Exhaustive search, good for discrete parameters
- Random search: Good baseline, surprisingly effective for large spaces
Configuration
- Search space: Define parameter ranges (continuous, discrete, categorical)
- Objective metric: The metric to optimize (accuracy, loss, AUC)
- Max trials: Total number of parameter combinations to try
- Parallel trials: Number of trials to run simultaneously
- Early stopping: Terminate underperforming trials to save cost
Distributed Training Strategies
The exam tests your knowledge of distributed training patterns. Know the differences:
| Strategy | How It Works | When to Use |
|---|---|---|
| Data Parallelism | Same model on each worker, different data batches. Gradients are averaged. | Most common. Data is large, model fits in one GPU's memory. |
| Model Parallelism | Different parts of the model on different workers. | Model is too large for a single GPU (LLMs, very deep networks). |
| MirroredStrategy | Synchronous data parallelism on multiple GPUs within one machine. | Multi-GPU training on a single machine. Most common TF strategy. |
| MultiWorkerMirroredStrategy | Synchronous data parallelism across multiple machines. | Dataset too large for one machine, need to scale horizontally. |
| TPUStrategy | Optimized for TPU pods with all-reduce communication. | Very large models trained on TPUs. |
| ParameterServerStrategy | Asynchronous training with parameter servers. | Very large embeddings, workers with variable speed. |
TPU Training on GCP
TPU (Tensor Processing Unit) questions appear frequently. Key facts:
- TPUs are optimized for matrix operations and work best with TensorFlow and JAX
- Data must be in tf.data.Dataset or TFRecord format for optimal TPU performance
- Batch size should be a multiple of 8 (for TPU v3) or 128 (for TPU pods)
- TPU v3 has 128 GB HBM per chip; TPU v4 has 32 GB HBM but higher throughput
- Use Cloud TPU VMs for direct access to the TPU host machine
- Store training data in Cloud Storage (not local disk) for TPU training
Model Evaluation Metrics
Know which metrics to use for each problem type:
Classification Metrics
- Accuracy: Overall correctness — misleading for imbalanced data
- Precision: Of predicted positives, how many are correct — minimize false positives
- Recall: Of actual positives, how many were found — minimize false negatives
- F1 Score: Harmonic mean of precision and recall — balanced metric
- AUC-ROC: Overall discrimination ability across all thresholds
- AUC-PR: Better than ROC for imbalanced datasets
Regression Metrics
- RMSE: Root mean squared error — penalizes large errors
- MAE: Mean absolute error — robust to outliers
- MAPE: Mean absolute percentage error — scale-independent
- R²: Proportion of variance explained — interpretable
Practice Questions
A. tf.distribute.MultiWorkerMirroredStrategy
B. tf.distribute.MirroredStrategy
C. tf.distribute.TPUStrategy
D. tf.distribute.ParameterServerStrategy
A. Set parallel trials to 100 to finish fastest
B. Set parallel trials to 20 (max_trials / 5)
C. Set parallel trials to 1 for pure sequential search
D. Set parallel trials to 50 for a balanced approach
A. Accuracy
B. Precision
C. Recall
D. F1 Score
A. Use a pre-built TensorFlow container and convert the model
B. Use a pre-built PyTorch container on Vertex AI
C. Build a custom container with PyTorch and the CUDA kernels, push to Artifact Registry, and configure a multi-worker training job
D. Use Vertex AI AutoML instead
Lilly Tech Systems