Intermediate

Google Cloud Run for AI Inference

Serve machine learning models on Cloud Run with GPU acceleration, concurrent request handling, minimum instances, and seamless Vertex AI integration.

Why Cloud Run for ML?

Cloud Run is uniquely positioned for AI inference because it supports GPU acceleration, handles multiple concurrent requests per instance, and allows up to 32 GB of memory. Unlike pure function-as-a-service platforms, Cloud Run runs full containers, giving you complete control over your inference runtime.

💡
Key advantage: Cloud Run supports NVIDIA L4 GPUs in serverless mode. This means you can serve GPU-accelerated models that scale to zero when idle, paying only for actual inference time. This is a game-changer for bursty GPU inference workloads.

Deploying an ML Model on Cloud Run

Python - FastAPI Inference Server
from fastapi import FastAPI
import torch
from transformers import pipeline

app = FastAPI()

# Load model at startup (shared across requests)
classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if torch.cuda.is_available() else -1
)

@app.post("/predict")
async def predict(request: dict):
    result = classifier(request["text"])
    return {"prediction": result}
gcloud - Deploy with GPU
# Deploy to Cloud Run with GPU
gcloud run deploy ml-inference \
  --image gcr.io/my-project/ml-inference:latest \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --memory 16Gi \
  --cpu 8 \
  --concurrency 10 \
  --min-instances 0 \
  --max-instances 20 \
  --port 8080 \
  --region us-central1

Concurrency Optimization

Unlike Lambda which processes one request per instance, Cloud Run can handle multiple concurrent requests per instance. For ML inference, this means you can batch requests together for more efficient GPU utilization.

Cloud Run vs Cloud Functions for ML

FeatureCloud RunCloud Functions
GPU SupportYes (L4)No
Max Memory32 GB32 GB
ConcurrencyUp to 1000 per instance1 per instance
Container SupportFull DockerSource-based or Docker
Startup TimeSlower (full container)Faster (lightweight)
Best practice: Set --concurrency based on your model's memory usage per request. For GPU models, start with concurrency equal to your batch size. For CPU models, set it to match your vCPU count. Monitor and tune based on latency percentiles.