Intermediate

Model Development

This lesson covers building and training ML models on Google Cloud, including framework selection, Vertex AI Training configuration, hyperparameter tuning, and distributed training strategies — all heavily tested on the exam.

Framework Selection on GCP

The exam tests whether you can choose the right framework for the task. Here is the decision guide:

Framework	Best For	GCP Integration
TensorFlow / Keras	Production ML, deep learning, TPU training	Native GCP support, TFX pipelines, SavedModel format
PyTorch	Research, NLP, computer vision prototyping	Vertex AI custom containers, TorchServe
XGBoost / scikit-learn	Tabular data, classical ML, fast iteration	Pre-built Vertex AI containers, BQML integration
JAX	High-performance numerical computing, custom gradients	TPU-native, used internally at Google

💡

Exam Tip: When a question mentions "TPU training" or "TFX pipeline integration," TensorFlow is almost always the correct choice. When it mentions "existing PyTorch codebase" or "research team prefers PyTorch," use Vertex AI custom containers with PyTorch.

Vertex AI Training

Vertex AI Training is the primary service for training custom models. Know these configuration options:

Pre-built Containers

Google provides pre-built Docker containers for common frameworks. Use these when possible to minimize setup:

TensorFlow (CPU/GPU), PyTorch (CPU/GPU), XGBoost, scikit-learn
Containers include the framework, CUDA drivers, and Vertex AI SDK pre-installed
You provide your training script as a Python package

Custom Containers

Use custom containers when you need specific dependencies or non-standard frameworks:

Build your own Docker image and push to Artifact Registry
Must accept AIP_MODEL_DIR environment variable for model output
Must accept AIP_TRAINING_DATA_URI for data input location

Machine Types and Accelerators

Accelerator	Best For	Cost Level
CPU only (n1-standard)	Small models, tabular data, XGBoost	$
NVIDIA T4 GPU	Inference, small-medium DL models	$$
NVIDIA V100 GPU	Medium-large DL training	$$$
NVIDIA A100 GPU	Large model training, multi-GPU	$$$$
TPU v3 / v4	Very large models, TensorFlow, JAX	$$$$

Hyperparameter Tuning with Vertex AI Vizier

Vertex AI supports automated hyperparameter tuning (HPT). Key concepts for the exam:

📊

Tuning Algorithms

Bayesian optimization: Default and most efficient for small parameter spaces
Grid search: Exhaustive search, good for discrete parameters
Random search: Good baseline, surprisingly effective for large spaces

⚡

Configuration

Search space: Define parameter ranges (continuous, discrete, categorical)
Objective metric: The metric to optimize (accuracy, loss, AUC)
Max trials: Total number of parameter combinations to try
Parallel trials: Number of trials to run simultaneously
Early stopping: Terminate underperforming trials to save cost

⚠

Exam Trap: More parallel trials does NOT always mean faster tuning. Bayesian optimization benefits from sequential results to inform the next trial. Running too many trials in parallel reduces the effectiveness of Bayesian search. Google recommends parallel trials ≤ max_trials / 5.

Distributed Training Strategies

The exam tests your knowledge of distributed training patterns. Know the differences:

Strategy	How It Works	When to Use
Data Parallelism	Same model on each worker, different data batches. Gradients are averaged.	Most common. Data is large, model fits in one GPU's memory.
Model Parallelism	Different parts of the model on different workers.	Model is too large for a single GPU (LLMs, very deep networks).
MirroredStrategy	Synchronous data parallelism on multiple GPUs within one machine.	Multi-GPU training on a single machine. Most common TF strategy.
MultiWorkerMirroredStrategy	Synchronous data parallelism across multiple machines.	Dataset too large for one machine, need to scale horizontally.
TPUStrategy	Optimized for TPU pods with all-reduce communication.	Very large models trained on TPUs.
ParameterServerStrategy	Asynchronous training with parameter servers.	Very large embeddings, workers with variable speed.

TPU Training on GCP

TPU (Tensor Processing Unit) questions appear frequently. Key facts:

TPUs are optimized for matrix operations and work best with TensorFlow and JAX
Data must be in tf.data.Dataset or TFRecord format for optimal TPU performance
Batch size should be a multiple of 8 (for TPU v3) or 128 (for TPU pods)
TPU v3 has 128 GB HBM per chip; TPU v4 has 32 GB HBM but higher throughput
Use Cloud TPU VMs for direct access to the TPU host machine
Store training data in Cloud Storage (not local disk) for TPU training

💡

Exam Tip: If a question mentions "cost optimization for training," look for answers involving: (1) preemptible/spot VMs for fault-tolerant jobs, (2) right-sizing machine types, (3) early stopping for hyperparameter tuning, (4) using managed datasets to avoid data duplication.

Model Evaluation Metrics

Know which metrics to use for each problem type:

📊

Classification Metrics

Accuracy: Overall correctness — misleading for imbalanced data
Precision: Of predicted positives, how many are correct — minimize false positives
Recall: Of actual positives, how many were found — minimize false negatives
F1 Score: Harmonic mean of precision and recall — balanced metric
AUC-ROC: Overall discrimination ability across all thresholds
AUC-PR: Better than ROC for imbalanced datasets

📈

Regression Metrics

RMSE: Root mean squared error — penalizes large errors
MAE: Mean absolute error — robust to outliers
MAPE: Mean absolute percentage error — scale-independent
R²: Proportion of variance explained — interpretable

Practice Questions

📝

Question 1: You are training a TensorFlow image classification model on a single machine with 4 NVIDIA V100 GPUs. Which distribution strategy should you use?

A. tf.distribute.MultiWorkerMirroredStrategy
B. tf.distribute.MirroredStrategy
C. tf.distribute.TPUStrategy
D. tf.distribute.ParameterServerStrategy

✅

Answer: B. MirroredStrategy is designed for synchronous data-parallel training across multiple GPUs on a SINGLE machine. MultiWorkerMirrored (A) is for multiple machines. TPUStrategy (C) is for TPUs, not GPUs. ParameterServer (D) is for asynchronous multi-machine training with large embedding tables.

📝

Question 2: You are tuning hyperparameters for a model using Vertex AI. You have 100 trial budget and want to maximize tuning efficiency. How should you configure parallel trials?

A. Set parallel trials to 100 to finish fastest
B. Set parallel trials to 20 (max_trials / 5)
C. Set parallel trials to 1 for pure sequential search
D. Set parallel trials to 50 for a balanced approach

✅

Answer: B. Google recommends parallel trials ≤ max_trials / 5 for Bayesian optimization. This allows enough sequential results for the algorithm to learn which regions of the parameter space are most promising, while still parallelizing. 100 parallel (A) eliminates Bayesian benefit (becomes random search). Pure sequential (C) is too slow. 50 (D) is too many parallel.

📝

Question 3: A fraud detection system needs to catch 99% of fraudulent transactions. The dataset is highly imbalanced (0.1% fraud). Which metric should you primarily optimize?

A. Accuracy
B. Precision
C. Recall
D. F1 Score

✅

Answer: C. "Catch 99% of fraudulent transactions" directly describes recall (true positive rate). The requirement is to minimize false negatives (missed fraud). Accuracy (A) is misleading with 0.1% positive rate — a model predicting "not fraud" for everything achieves 99.9% accuracy. Precision (B) minimizes false positives. F1 (D) balances precision and recall but does not prioritize the 99% catch rate.

📝

Question 4: Your research team has an existing PyTorch model with custom CUDA kernels. They want to train it on Vertex AI using 8 A100 GPUs across 2 machines. What should you do?

A. Use a pre-built TensorFlow container and convert the model
B. Use a pre-built PyTorch container on Vertex AI
C. Build a custom container with PyTorch and the CUDA kernels, push to Artifact Registry, and configure a multi-worker training job
D. Use Vertex AI AutoML instead

✅

Answer: C. Custom CUDA kernels require a custom container because pre-built containers do not include arbitrary custom extensions. The custom container is pushed to Artifact Registry and configured as a multi-worker Vertex AI training job. Converting to TF (A) would break custom CUDA kernels. Pre-built PyTorch container (B) would not include the custom kernels. AutoML (D) does not support custom architectures.

← Previous Data Preparation & Processing Next → ML Pipeline Automation