Intermediate
Pipeline Components
Create reusable pipeline components, leverage pre-built Google Cloud components, and package custom components as container images for production use.
Lightweight Python Components
The simplest components are Python functions with the @dsl.component decorator:
@dsl.component(base_image="python:3.11", packages_to_install=["requests"])
def fetch_data(url: str, output: dsl.OutputPath("Dataset")):
import requests
response = requests.get(url)
with open(output, "w") as f:
f.write(response.text)
Container Components
For complex components with custom dependencies, build a dedicated Docker image:
@dsl.container_component
def gpu_training():
return dsl.ContainerSpec(
image="my-registry/gpu-trainer:v1.0",
command=["python", "train.py"],
args=["--epochs", "100", "--batch-size", "32"]
)
Pre-Built Components
Google Cloud provides pre-built components for common ML tasks:
from google_cloud_pipeline_components.v1.dataset import TabularDatasetCreateOp
from google_cloud_pipeline_components.v1.automl.training_job import AutoMLTabularTrainingJobRunOp
@dsl.pipeline(name="vertex-pipeline")
def vertex_training():
dataset = TabularDatasetCreateOp(
display_name="my-dataset",
bq_source="bq://project.dataset.table"
)
training = AutoMLTabularTrainingJobRunOp(
display_name="my-model",
dataset=dataset.outputs["dataset"]
)
Component Best Practices
- Single responsibility: Each component should do one thing well (load data, train, evaluate, deploy).
- Type annotations: Always use typed inputs and outputs for validation and artifact tracking.
- Versioned images: Pin container image versions to ensure reproducibility across runs.
- Small base images: Use minimal base images to reduce component startup time and storage costs.
Sharing components: Package reusable components as Python packages and publish them to your organization's package registry. This enables consistent ML operations across teams and projects.
Lilly Tech Systems