Intermediate

Model Development

This lesson covers building and training ML models on Google Cloud, including framework selection, Vertex AI Training configuration, hyperparameter tuning, and distributed training strategies — all heavily tested on the exam.

Framework Selection on GCP

The exam tests whether you can choose the right framework for the task. Here is the decision guide:

FrameworkBest ForGCP Integration
TensorFlow / KerasProduction ML, deep learning, TPU trainingNative GCP support, TFX pipelines, SavedModel format
PyTorchResearch, NLP, computer vision prototypingVertex AI custom containers, TorchServe
XGBoost / scikit-learnTabular data, classical ML, fast iterationPre-built Vertex AI containers, BQML integration
JAXHigh-performance numerical computing, custom gradientsTPU-native, used internally at Google
💡
Exam Tip: When a question mentions "TPU training" or "TFX pipeline integration," TensorFlow is almost always the correct choice. When it mentions "existing PyTorch codebase" or "research team prefers PyTorch," use Vertex AI custom containers with PyTorch.

Vertex AI Training

Vertex AI Training is the primary service for training custom models. Know these configuration options:

Pre-built Containers

Google provides pre-built Docker containers for common frameworks. Use these when possible to minimize setup:

  • TensorFlow (CPU/GPU), PyTorch (CPU/GPU), XGBoost, scikit-learn
  • Containers include the framework, CUDA drivers, and Vertex AI SDK pre-installed
  • You provide your training script as a Python package

Custom Containers

Use custom containers when you need specific dependencies or non-standard frameworks:

  • Build your own Docker image and push to Artifact Registry
  • Must accept AIP_MODEL_DIR environment variable for model output
  • Must accept AIP_TRAINING_DATA_URI for data input location

Machine Types and Accelerators

AcceleratorBest ForCost Level
CPU only (n1-standard)Small models, tabular data, XGBoost$
NVIDIA T4 GPUInference, small-medium DL models$$
NVIDIA V100 GPUMedium-large DL training$$$
NVIDIA A100 GPULarge model training, multi-GPU$$$$
TPU v3 / v4Very large models, TensorFlow, JAX$$$$

Hyperparameter Tuning with Vertex AI Vizier

Vertex AI supports automated hyperparameter tuning (HPT). Key concepts for the exam:

📊

Tuning Algorithms

  • Bayesian optimization: Default and most efficient for small parameter spaces
  • Grid search: Exhaustive search, good for discrete parameters
  • Random search: Good baseline, surprisingly effective for large spaces

Configuration

  • Search space: Define parameter ranges (continuous, discrete, categorical)
  • Objective metric: The metric to optimize (accuracy, loss, AUC)
  • Max trials: Total number of parameter combinations to try
  • Parallel trials: Number of trials to run simultaneously
  • Early stopping: Terminate underperforming trials to save cost
Exam Trap: More parallel trials does NOT always mean faster tuning. Bayesian optimization benefits from sequential results to inform the next trial. Running too many trials in parallel reduces the effectiveness of Bayesian search. Google recommends parallel trials ≤ max_trials / 5.

Distributed Training Strategies

The exam tests your knowledge of distributed training patterns. Know the differences:

StrategyHow It WorksWhen to Use
Data ParallelismSame model on each worker, different data batches. Gradients are averaged.Most common. Data is large, model fits in one GPU's memory.
Model ParallelismDifferent parts of the model on different workers.Model is too large for a single GPU (LLMs, very deep networks).
MirroredStrategySynchronous data parallelism on multiple GPUs within one machine.Multi-GPU training on a single machine. Most common TF strategy.
MultiWorkerMirroredStrategySynchronous data parallelism across multiple machines.Dataset too large for one machine, need to scale horizontally.
TPUStrategyOptimized for TPU pods with all-reduce communication.Very large models trained on TPUs.
ParameterServerStrategyAsynchronous training with parameter servers.Very large embeddings, workers with variable speed.

TPU Training on GCP

TPU (Tensor Processing Unit) questions appear frequently. Key facts:

  • TPUs are optimized for matrix operations and work best with TensorFlow and JAX
  • Data must be in tf.data.Dataset or TFRecord format for optimal TPU performance
  • Batch size should be a multiple of 8 (for TPU v3) or 128 (for TPU pods)
  • TPU v3 has 128 GB HBM per chip; TPU v4 has 32 GB HBM but higher throughput
  • Use Cloud TPU VMs for direct access to the TPU host machine
  • Store training data in Cloud Storage (not local disk) for TPU training
💡
Exam Tip: If a question mentions "cost optimization for training," look for answers involving: (1) preemptible/spot VMs for fault-tolerant jobs, (2) right-sizing machine types, (3) early stopping for hyperparameter tuning, (4) using managed datasets to avoid data duplication.

Model Evaluation Metrics

Know which metrics to use for each problem type:

📊

Classification Metrics

  • Accuracy: Overall correctness — misleading for imbalanced data
  • Precision: Of predicted positives, how many are correct — minimize false positives
  • Recall: Of actual positives, how many were found — minimize false negatives
  • F1 Score: Harmonic mean of precision and recall — balanced metric
  • AUC-ROC: Overall discrimination ability across all thresholds
  • AUC-PR: Better than ROC for imbalanced datasets
📈

Regression Metrics

  • RMSE: Root mean squared error — penalizes large errors
  • MAE: Mean absolute error — robust to outliers
  • MAPE: Mean absolute percentage error — scale-independent
  • R²: Proportion of variance explained — interpretable

Practice Questions

📝
Question 1: You are training a TensorFlow image classification model on a single machine with 4 NVIDIA V100 GPUs. Which distribution strategy should you use?

A. tf.distribute.MultiWorkerMirroredStrategy
B. tf.distribute.MirroredStrategy
C. tf.distribute.TPUStrategy
D. tf.distribute.ParameterServerStrategy
Answer: B. MirroredStrategy is designed for synchronous data-parallel training across multiple GPUs on a SINGLE machine. MultiWorkerMirrored (A) is for multiple machines. TPUStrategy (C) is for TPUs, not GPUs. ParameterServer (D) is for asynchronous multi-machine training with large embedding tables.
📝
Question 2: You are tuning hyperparameters for a model using Vertex AI. You have 100 trial budget and want to maximize tuning efficiency. How should you configure parallel trials?

A. Set parallel trials to 100 to finish fastest
B. Set parallel trials to 20 (max_trials / 5)
C. Set parallel trials to 1 for pure sequential search
D. Set parallel trials to 50 for a balanced approach
Answer: B. Google recommends parallel trials ≤ max_trials / 5 for Bayesian optimization. This allows enough sequential results for the algorithm to learn which regions of the parameter space are most promising, while still parallelizing. 100 parallel (A) eliminates Bayesian benefit (becomes random search). Pure sequential (C) is too slow. 50 (D) is too many parallel.
📝
Question 3: A fraud detection system needs to catch 99% of fraudulent transactions. The dataset is highly imbalanced (0.1% fraud). Which metric should you primarily optimize?

A. Accuracy
B. Precision
C. Recall
D. F1 Score
Answer: C. "Catch 99% of fraudulent transactions" directly describes recall (true positive rate). The requirement is to minimize false negatives (missed fraud). Accuracy (A) is misleading with 0.1% positive rate — a model predicting "not fraud" for everything achieves 99.9% accuracy. Precision (B) minimizes false positives. F1 (D) balances precision and recall but does not prioritize the 99% catch rate.
📝
Question 4: Your research team has an existing PyTorch model with custom CUDA kernels. They want to train it on Vertex AI using 8 A100 GPUs across 2 machines. What should you do?

A. Use a pre-built TensorFlow container and convert the model
B. Use a pre-built PyTorch container on Vertex AI
C. Build a custom container with PyTorch and the CUDA kernels, push to Artifact Registry, and configure a multi-worker training job
D. Use Vertex AI AutoML instead
Answer: C. Custom CUDA kernels require a custom container because pre-built containers do not include arbitrary custom extensions. The custom container is pushed to Artifact Registry and configured as a multi-worker Vertex AI training job. Converting to TF (A) would break custom CUDA kernels. Pre-built PyTorch container (B) would not include the custom kernels. AutoML (D) does not support custom architectures.