Intermediate

Data Preparation & Processing

Data quality determines model quality. This lesson covers the GCP services used for data ingestion, transformation, feature engineering, and validation — all heavily tested on the exam.

GCP Data Processing Services Comparison

The exam frequently asks you to choose between these three data processing services. Know their strengths:

ServiceBest ForProgramming ModelKey Characteristics
BigQuerySQL-based analytics and transformationsSQLServerless, petabyte-scale, built-in ML (BQML)
DataflowStreaming and batch ETL pipelinesApache Beam (Python/Java)Serverless, auto-scaling, exactly-once processing
DataprocExisting Spark/Hadoop workloadsPySpark, Spark SQLManaged clusters, pay-per-use, lift-and-shift
💡
Exam Decision Rule: If the question mentions "streaming data" or "real-time processing," the answer is almost always Dataflow. If it mentions "existing Spark jobs" or "Hadoop migration," choose Dataproc. If it mentions "SQL transformations" or "data already in BigQuery," choose BigQuery.

Dataflow for ML Data Pipelines

Dataflow (Apache Beam) is Google's preferred service for building ML data pipelines. Key concepts for the exam:

  • Unified batch and streaming: Same pipeline code handles both batch (bounded) and streaming (unbounded) data
  • Windowing: Fixed windows, sliding windows, and session windows for aggregating streaming data
  • Watermarks: Handle late-arriving data with configurable allowed lateness
  • Side inputs: Join streaming data with static lookup tables (e.g., feature enrichment)
  • TFX integration: Dataflow is the default runner for TFX components (ExampleGen, Transform, SchemaGen)

Feature Engineering on GCP

Feature engineering is a major exam topic. Know these techniques and when to apply them:

📊

Numerical Features

  • Normalization: Scale to [0,1] range — use when features have different scales
  • Standardization: Zero mean, unit variance — better for algorithms assuming normal distribution
  • Bucketization: Convert continuous to categorical (age ranges) — captures non-linear relationships
  • Log transform: Handle skewed distributions (income, prices)
📌

Categorical Features

  • One-hot encoding: For low-cardinality categories (color, gender)
  • Feature hashing: For high-cardinality categories (user IDs, product IDs)
  • Embedding: For very high cardinality — learned dense representations
  • Cross features: Combine two categorical features (city + device_type)
🛠

Temporal Features

  • Time extraction: Hour, day-of-week, month, quarter from timestamps
  • Lag features: Previous values for time series (t-1, t-7, t-30)
  • Rolling aggregations: Moving averages, rolling sums over windows
  • Time since event: Days since last purchase, hours since last login

Vertex AI Feature Store

Feature Store is GCP's centralized repository for managing, sharing, and serving ML features. Key concepts:

  • Feature consistency: Ensures the same feature definitions are used in training and serving (prevents training-serving skew)
  • Online serving: Low-latency feature lookups for real-time predictions (backed by Bigtable)
  • Offline serving: Bulk feature retrieval for training datasets (backed by BigQuery)
  • Point-in-time lookups: Retrieve features as they were at a specific timestamp (prevents data leakage)
  • Feature sharing: Teams can discover and reuse features across projects
Training-Serving Skew: This is a critical exam topic. Skew occurs when features are computed differently during training vs. serving. Solutions: (1) Use Feature Store for both training and serving, (2) Use TFX Transform to generate a transform graph that is applied consistently, (3) Version your feature engineering code.

TFX (TensorFlow Extended) Pipeline Components

TFX is tested heavily. Know what each component does:

ComponentPurposeOutput
ExampleGenIngests and splits datatf.Example records
StatisticsGenComputes dataset statisticsData statistics (TFDV)
SchemaGenInfers data schemaSchema definition
ExampleValidatorDetects data anomaliesAnomaly report
TransformFeature engineeringTransform graph + transformed data
TrainerModel trainingTrained model
EvaluatorModel evaluationEvaluation metrics
PusherDeploys model to servingDeployed model

Data Validation with TFDV

TensorFlow Data Validation (TFDV) helps catch data issues before they reach your model:

  • Schema validation: Detect unexpected feature types, missing features, or out-of-range values
  • Distribution drift: Compare training data statistics against serving data or new batches
  • Skew detection: Identify differences between training and serving data distributions
  • Anomaly detection: Flag unexpected values, new categories, or data quality issues

Practice Questions

📝
Question 1: Your team receives streaming clickstream data from a web application. You need to compute session-level features (pages per session, time on site) and store them for both online prediction and offline training. Which architecture should you use?

A. Pub/Sub → Dataflow → BigQuery (offline) + Bigtable (online)
B. Pub/Sub → Dataproc → Cloud Storage → BigQuery
C. Pub/Sub → Dataflow → Vertex AI Feature Store
D. Pub/Sub → Cloud Functions → Firestore
Answer: C. Vertex AI Feature Store is designed exactly for this use case: it provides both online serving (low-latency lookups) and offline serving (bulk retrieval for training) from a single source of truth. Dataflow handles the streaming session aggregation. Option A could work but requires managing two separate stores. Dataproc (B) is not ideal for streaming. Cloud Functions (D) do not handle session windowing.
📝
Question 2: You are building an ML pipeline that processes 10 TB of log data daily. The data scientists use PySpark scripts they wrote for an on-premises Hadoop cluster. You need to migrate to GCP with minimal code changes. Which service should you use?

A. Dataflow with Apache Beam
B. Dataproc with PySpark
C. BigQuery with SQL
D. Cloud Composer with Airflow
Answer: B. Dataproc is Google's managed Spark/Hadoop service, designed for lift-and-shift migration of existing PySpark workloads. "Minimal code changes" is the key constraint. Dataflow (A) would require rewriting all PySpark code in Apache Beam. BigQuery (C) would require rewriting in SQL. Cloud Composer (D) is an orchestrator, not a data processing engine.
📝
Question 3: Your model's accuracy has degraded in production. Investigation reveals that a categorical feature "payment_method" now includes a new value "crypto" that was not in the training data. Which TFX component would have caught this issue?

A. StatisticsGen
B. ExampleValidator
C. Transform
D. Evaluator
Answer: B. ExampleValidator compares incoming data against the established schema. A new categorical value ("crypto") not in the schema would be flagged as an anomaly. StatisticsGen (A) computes statistics but does not flag anomalies. Transform (C) applies feature engineering, not validation. Evaluator (D) evaluates model quality, not data quality.
📝
Question 4: You notice that your model performs well in training but poorly in production. The same features are used, but the prediction accuracy drops by 15%. You suspect training-serving skew. What is the most likely cause and fix?

A. The model is overfitting — add regularization
B. Features are computed differently in training vs. serving — use TFX Transform
C. The serving infrastructure is underpowered — increase CPU/memory
D. The training data is too old — retrain with recent data
Answer: B. Training-serving skew is most commonly caused by inconsistent feature computation. TFX Transform creates a serialized transform graph that ensures identical feature transformations during both training and serving. Overfitting (A) would show low training loss but high validation loss. Infrastructure (C) affects latency, not accuracy. Stale data (D) causes data drift, a different problem from skew.