Deploying Models on Vertex AI Intermediate

Once your model is trained, Vertex AI provides managed endpoints for serving predictions. This lesson covers deploying models to endpoints, configuring online and batch predictions, using Model Garden for pre-trained models, and implementing traffic splitting for A/B testing.

Deploying to an Endpoint

An endpoint is a managed HTTPS service that hosts your model and serves predictions. You can deploy one or more models to a single endpoint with configurable traffic splitting.

Python

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

# Get the trained model
model = aiplatform.Model("projects/my-project/locations/us-central1/models/MODEL_ID")

# Create an endpoint
endpoint = aiplatform.Endpoint.create(display_name="my-endpoint")

# Deploy the model to the endpoint
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name="my-model-v1",
    machine_type="n1-standard-4",
    min_replica_count=1,
    max_replica_count=5,
    traffic_percentage=100,
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1
)

Online Prediction

Online prediction provides low-latency, real-time predictions for individual or small batches of instances:

Python

# Make a prediction
instances = [
    {"feature_1": 1.5, "feature_2": "category_a", "feature_3": 42},
    {"feature_1": 2.3, "feature_2": "category_b", "feature_3": 17}
]

prediction = endpoint.predict(instances=instances)
print(prediction.predictions)

Batch Prediction

For processing large volumes of data, batch prediction is more cost-effective than online prediction. It reads data from Cloud Storage and writes results back:

Python

# Run batch prediction
batch_prediction_job = model.batch_predict(
    job_display_name="my-batch-job",
    gcs_source="gs://my-bucket/input-data.jsonl",
    gcs_destination_prefix="gs://my-bucket/predictions/",
    machine_type="n1-standard-4",
    starting_replica_count=2,
    max_replica_count=10
)

batch_prediction_job.wait()
print(f"Output: {batch_prediction_job.output_info}")

Traffic Splitting & A/B Testing

Deploy multiple model versions to the same endpoint and split traffic for A/B testing or gradual rollouts:

Python

# Deploy a new model version with traffic splitting
new_model = aiplatform.Model("projects/my-project/locations/us-central1/models/NEW_MODEL_ID")

endpoint.deploy(
    model=new_model,
    deployed_model_display_name="my-model-v2",
    machine_type="n1-standard-4",
    traffic_percentage=20  # 20% of traffic goes to v2
)

# Update traffic split later
endpoint.update(traffic_split={
    "deployed-model-id-v1": 50,
    "deployed-model-id-v2": 50
})

Model Garden

Model Garden provides a curated collection of pre-trained and foundation models that you can deploy directly or fine-tune:

Model Category	Examples	Use Cases
Foundation Models	Gemini, PaLM 2, Codey	Text generation, code generation, chat
Image Models	Imagen, Stable Diffusion	Image generation, editing
Open Source	Llama, Mistral, FLAN-T5	Text, code, multi-modal tasks
Task-specific	BERT, ResNet, EfficientNet	Classification, detection, embeddings

Cost Optimization: Use auto-scaling with min_replica_count=0 for endpoints with sporadic traffic. The endpoint will scale to zero when not in use, eliminating idle costs (note: cold start latency applies).

Models Deployed!

Your models are now serving predictions. In the next lesson, you will learn how to build end-to-end ML pipelines with Vertex AI Pipelines and Kubeflow.

Next: Pipelines →

← Training Pipelines →