Intermediate

Data Preparation & Processing

Data quality determines model quality. This lesson covers the GCP services used for data ingestion, transformation, feature engineering, and validation — all heavily tested on the exam.

GCP Data Processing Services Comparison

The exam frequently asks you to choose between these three data processing services. Know their strengths:

Service	Best For	Programming Model	Key Characteristics
BigQuery	SQL-based analytics and transformations	SQL	Serverless, petabyte-scale, built-in ML (BQML)
Dataflow	Streaming and batch ETL pipelines	Apache Beam (Python/Java)	Serverless, auto-scaling, exactly-once processing
Dataproc	Existing Spark/Hadoop workloads	PySpark, Spark SQL	Managed clusters, pay-per-use, lift-and-shift

💡

Exam Decision Rule: If the question mentions "streaming data" or "real-time processing," the answer is almost always Dataflow. If it mentions "existing Spark jobs" or "Hadoop migration," choose Dataproc. If it mentions "SQL transformations" or "data already in BigQuery," choose BigQuery.

Dataflow for ML Data Pipelines

Dataflow (Apache Beam) is Google's preferred service for building ML data pipelines. Key concepts for the exam:

Unified batch and streaming: Same pipeline code handles both batch (bounded) and streaming (unbounded) data
Windowing: Fixed windows, sliding windows, and session windows for aggregating streaming data
Watermarks: Handle late-arriving data with configurable allowed lateness
Side inputs: Join streaming data with static lookup tables (e.g., feature enrichment)
TFX integration: Dataflow is the default runner for TFX components (ExampleGen, Transform, SchemaGen)

Feature Engineering on GCP

Feature engineering is a major exam topic. Know these techniques and when to apply them:

📊

Numerical Features

Normalization: Scale to [0,1] range — use when features have different scales
Standardization: Zero mean, unit variance — better for algorithms assuming normal distribution
Bucketization: Convert continuous to categorical (age ranges) — captures non-linear relationships
Log transform: Handle skewed distributions (income, prices)

📌

Categorical Features

One-hot encoding: For low-cardinality categories (color, gender)
Feature hashing: For high-cardinality categories (user IDs, product IDs)
Embedding: For very high cardinality — learned dense representations
Cross features: Combine two categorical features (city + device_type)

🛠

Temporal Features

Time extraction: Hour, day-of-week, month, quarter from timestamps
Lag features: Previous values for time series (t-1, t-7, t-30)
Rolling aggregations: Moving averages, rolling sums over windows
Time since event: Days since last purchase, hours since last login

Vertex AI Feature Store

Feature Store is GCP's centralized repository for managing, sharing, and serving ML features. Key concepts:

Feature consistency: Ensures the same feature definitions are used in training and serving (prevents training-serving skew)
Online serving: Low-latency feature lookups for real-time predictions (backed by Bigtable)
Offline serving: Bulk feature retrieval for training datasets (backed by BigQuery)
Point-in-time lookups: Retrieve features as they were at a specific timestamp (prevents data leakage)
Feature sharing: Teams can discover and reuse features across projects

⚠

Training-Serving Skew: This is a critical exam topic. Skew occurs when features are computed differently during training vs. serving. Solutions: (1) Use Feature Store for both training and serving, (2) Use TFX Transform to generate a transform graph that is applied consistently, (3) Version your feature engineering code.

TFX (TensorFlow Extended) Pipeline Components

TFX is tested heavily. Know what each component does:

Component	Purpose	Output
ExampleGen	Ingests and splits data	tf.Example records
StatisticsGen	Computes dataset statistics	Data statistics (TFDV)
SchemaGen	Infers data schema	Schema definition
ExampleValidator	Detects data anomalies	Anomaly report
Transform	Feature engineering	Transform graph + transformed data
Trainer	Model training	Trained model
Evaluator	Model evaluation	Evaluation metrics
Pusher	Deploys model to serving	Deployed model

Data Validation with TFDV

TensorFlow Data Validation (TFDV) helps catch data issues before they reach your model:

Schema validation: Detect unexpected feature types, missing features, or out-of-range values
Distribution drift: Compare training data statistics against serving data or new batches
Skew detection: Identify differences between training and serving data distributions
Anomaly detection: Flag unexpected values, new categories, or data quality issues

Practice Questions

📝

Question 1: Your team receives streaming clickstream data from a web application. You need to compute session-level features (pages per session, time on site) and store them for both online prediction and offline training. Which architecture should you use?

A. Pub/Sub → Dataflow → BigQuery (offline) + Bigtable (online)
B. Pub/Sub → Dataproc → Cloud Storage → BigQuery
C. Pub/Sub → Dataflow → Vertex AI Feature Store
D. Pub/Sub → Cloud Functions → Firestore

✅

Answer: C. Vertex AI Feature Store is designed exactly for this use case: it provides both online serving (low-latency lookups) and offline serving (bulk retrieval for training) from a single source of truth. Dataflow handles the streaming session aggregation. Option A could work but requires managing two separate stores. Dataproc (B) is not ideal for streaming. Cloud Functions (D) do not handle session windowing.

📝

Question 2: You are building an ML pipeline that processes 10 TB of log data daily. The data scientists use PySpark scripts they wrote for an on-premises Hadoop cluster. You need to migrate to GCP with minimal code changes. Which service should you use?

A. Dataflow with Apache Beam
B. Dataproc with PySpark
C. BigQuery with SQL
D. Cloud Composer with Airflow

✅

Answer: B. Dataproc is Google's managed Spark/Hadoop service, designed for lift-and-shift migration of existing PySpark workloads. "Minimal code changes" is the key constraint. Dataflow (A) would require rewriting all PySpark code in Apache Beam. BigQuery (C) would require rewriting in SQL. Cloud Composer (D) is an orchestrator, not a data processing engine.

📝

Question 3: Your model's accuracy has degraded in production. Investigation reveals that a categorical feature "payment_method" now includes a new value "crypto" that was not in the training data. Which TFX component would have caught this issue?

A. StatisticsGen
B. ExampleValidator
C. Transform
D. Evaluator

✅

Answer: B. ExampleValidator compares incoming data against the established schema. A new categorical value ("crypto") not in the schema would be flagged as an anomaly. StatisticsGen (A) computes statistics but does not flag anomalies. Transform (C) applies feature engineering, not validation. Evaluator (D) evaluates model quality, not data quality.

📝

Question 4: You notice that your model performs well in training but poorly in production. The same features are used, but the prediction accuracy drops by 15%. You suspect training-serving skew. What is the most likely cause and fix?

A. The model is overfitting — add regularization
B. Features are computed differently in training vs. serving — use TFX Transform
C. The serving infrastructure is underpowered — increase CPU/memory
D. The training data is too old — retrain with recent data

✅

Answer: B. Training-serving skew is most commonly caused by inconsistent feature computation. TFX Transform creates a serialized transform graph that ensures identical feature transformations during both training and serving. Overfitting (A) would show low training loss but high validation loss. Infrastructure (C) affects latency, not accuracy. Stale data (D) causes data drift, a different problem from skew.

← Previous Architecting ML Solutions Next → Model Development