Deploying Models on Vertex AI Intermediate
Once your model is trained, Vertex AI provides managed endpoints for serving predictions. This lesson covers deploying models to endpoints, configuring online and batch predictions, using Model Garden for pre-trained models, and implementing traffic splitting for A/B testing.
Deploying to an Endpoint
An endpoint is a managed HTTPS service that hosts your model and serves predictions. You can deploy one or more models to a single endpoint with configurable traffic splitting.
from google.cloud import aiplatform aiplatform.init(project="my-project", location="us-central1") # Get the trained model model = aiplatform.Model("projects/my-project/locations/us-central1/models/MODEL_ID") # Create an endpoint endpoint = aiplatform.Endpoint.create(display_name="my-endpoint") # Deploy the model to the endpoint model.deploy( endpoint=endpoint, deployed_model_display_name="my-model-v1", machine_type="n1-standard-4", min_replica_count=1, max_replica_count=5, traffic_percentage=100, accelerator_type="NVIDIA_TESLA_T4", accelerator_count=1 )
Online Prediction
Online prediction provides low-latency, real-time predictions for individual or small batches of instances:
# Make a prediction instances = [ {"feature_1": 1.5, "feature_2": "category_a", "feature_3": 42}, {"feature_1": 2.3, "feature_2": "category_b", "feature_3": 17} ] prediction = endpoint.predict(instances=instances) print(prediction.predictions)
Batch Prediction
For processing large volumes of data, batch prediction is more cost-effective than online prediction. It reads data from Cloud Storage and writes results back:
# Run batch prediction batch_prediction_job = model.batch_predict( job_display_name="my-batch-job", gcs_source="gs://my-bucket/input-data.jsonl", gcs_destination_prefix="gs://my-bucket/predictions/", machine_type="n1-standard-4", starting_replica_count=2, max_replica_count=10 ) batch_prediction_job.wait() print(f"Output: {batch_prediction_job.output_info}")
Traffic Splitting & A/B Testing
Deploy multiple model versions to the same endpoint and split traffic for A/B testing or gradual rollouts:
# Deploy a new model version with traffic splitting new_model = aiplatform.Model("projects/my-project/locations/us-central1/models/NEW_MODEL_ID") endpoint.deploy( model=new_model, deployed_model_display_name="my-model-v2", machine_type="n1-standard-4", traffic_percentage=20 # 20% of traffic goes to v2 ) # Update traffic split later endpoint.update(traffic_split={ "deployed-model-id-v1": 50, "deployed-model-id-v2": 50 })
Model Garden
Model Garden provides a curated collection of pre-trained and foundation models that you can deploy directly or fine-tune:
| Model Category | Examples | Use Cases |
|---|---|---|
| Foundation Models | Gemini, PaLM 2, Codey | Text generation, code generation, chat |
| Image Models | Imagen, Stable Diffusion | Image generation, editing |
| Open Source | Llama, Mistral, FLAN-T5 | Text, code, multi-modal tasks |
| Task-specific | BERT, ResNet, EfficientNet | Classification, detection, embeddings |
min_replica_count=0 for endpoints with sporadic traffic. The endpoint will scale to zero when not in use, eliminating idle costs (note: cold start latency applies).
Models Deployed!
Your models are now serving predictions. In the next lesson, you will learn how to build end-to-end ML pipelines with Vertex AI Pipelines and Kubeflow.
Next: Pipelines →
Lilly Tech Systems