Data Engineering for ML
These 10 questions cover data engineering topics critical for MLOps interviews. Data is the foundation of every ML system — interviewers want to know you can build reliable, scalable data pipelines that feed models the right data at the right time.
Q1: How do you design a data pipeline for ML training?
An ML training data pipeline has stages, each with specific quality gates:
- Data extraction: Pull from source systems (databases, APIs, event streams, data lakes). Use incremental extraction (only new/changed data) to avoid reprocessing everything. Record extraction timestamps for data lineage.
- Data cleaning: Handle missing values, remove duplicates, fix encoding issues, normalize formats. Log every transformation applied so it is reproducible. Never modify source data — clean into a staging area.
- Data validation: Run schema checks (column types, required fields), statistical checks (distribution within expected ranges), and business rule checks (no negative prices, dates in the future). Use Great Expectations or TensorFlow Data Validation. Fail the pipeline if validation fails.
- Feature engineering: Compute derived features from raw data. Use the same feature computation code as the serving path (or a feature store) to prevent training-serving skew. Log feature statistics (mean, std, percentiles).
- Dataset creation: Split into train/validation/test sets. Ensure no data leakage across splits (especially for time-series data — split by time, not randomly). Version the final dataset with DVC or Delta Lake.
- Data storage: Store processed datasets in a format optimized for ML training: Parquet (columnar, compressed, fast reads) or TFRecord (TensorFlow-native). Partition by date for easy incremental processing.
Orchestration: Use Apache Airflow or Prefect to schedule and orchestrate pipeline stages. Each stage is idempotent (can be re-run safely). Include retry logic with exponential backoff for transient failures.
Q2: How do you implement data quality checks for ML?
Data quality for ML goes beyond traditional data engineering checks. ML-specific data quality means the data is suitable for model training and will produce reliable predictions:
- Schema validation: Column names, types, and ordering match the expected schema. New columns do not appear unexpectedly. Required columns are not missing. Use tools like Pandera or Great Expectations.
- Completeness checks: Missing value rates for each feature must be below a threshold (e.g., <5%). If a critical feature suddenly has 50% nulls, the upstream data source likely broke.
- Distribution checks: Compare the distribution of each feature against a reference baseline (the data the production model was trained on). Use PSI or KS test. If distributions shift significantly, the data may no longer be suitable for the current model.
- Freshness checks: Verify that data is not stale. If the pipeline runs daily, the most recent record should be from today or yesterday. Stale data means the upstream source stopped sending data.
- Consistency checks: Cross-column validation. Example: if "country" is "US", then "currency" should be "USD". If "age" is 5, then "has_driver_license" should be false. These catch upstream data corruption.
- Label quality: For supervised learning, check label distribution (sudden class imbalance changes), label noise rate (random sample verification), and label freshness (stale labels may not reflect current reality).
# Great Expectations example
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("training_data.csv")
# Define expectations
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_values_to_be_in_set("country", ["US", "UK", "CA", "DE", "FR"])
validator.expect_column_mean_to_be_between("purchase_amount", min_value=10, max_value=500)
validator.expect_column_proportion_of_unique_values_to_be_between("user_id", min_value=0.9)
results = validator.validate()
if not results.success:
raise DataQualityError(f"Data validation failed: {results}")
Q3: How do you handle feature engineering at scale?
Feature engineering at scale requires handling three challenges: compute efficiency, consistency, and discovery.
- Batch feature computation: Use distributed processing frameworks (Apache Spark, Dask, or BigQuery) to compute features over large datasets. Optimize with column pruning (only read needed columns), predicate pushdown (filter early), and caching intermediate results.
- Real-time feature computation: Some features must be computed at inference time (e.g., "user's last 5 minutes of activity"). Use streaming frameworks (Kafka Streams, Flink) for real-time aggregations. Store results in a low-latency online store (Redis) for serving.
- Feature store pattern: Compute each feature once in a canonical pipeline. Store in both offline (warehouse) and online (key-value) stores. Both training and serving read from the same store, eliminating training-serving skew.
- Feature catalog: Maintain a registry of all features with documentation: name, description, owner, computation logic, data source, freshness SLA, and which models use it. This prevents teams from independently building the same feature differently.
Common pitfalls:
- Feature leakage: Using information not available at prediction time. Example: computing "average transaction amount this month" using future transactions. Always use point-in-time correct joins.
- Cardinality explosion: One-hot encoding a feature with 100K categories creates 100K columns. Use embedding layers, target encoding, or frequency encoding instead.
- Stale features: A feature computed daily might be 23 hours old at serving time. If freshness matters, compute it in real-time or update more frequently.
Q4: What is data versioning and why is it critical for ML?
Data versioning tracks changes to datasets over time, enabling reproducibility, debugging, and rollback. It is essential because ML models are a function of both code and data — changing either changes the model.
Why it matters:
- Reproducibility: To reproduce a model, you need the exact data it was trained on. Without data versioning, "retrain on the same data" is impossible because the data source has been updated since then.
- Debugging: When a model degrades, you need to compare the current training data with the data that produced the good model. What changed? Was bad data introduced? Were labels corrupted?
- Compliance: Regulated industries (finance, healthcare) require audit trails. You must be able to show exactly what data a model was trained on and when.
- Rollback: If new training data is corrupted, roll back to the previous data version and retrain.
Approaches:
- DVC (Data Version Control): Git-like versioning for data files. Stores data in remote storage (S3, GCS) and metadata (hashes, paths) in Git. Track dataset versions alongside code versions. Best for file-based datasets.
- Delta Lake / Apache Iceberg: Table formats that support time-travel queries on data warehouses. Query data as it existed at any point in time. Best for large-scale structured data in data lakes.
- LakeFS: Git-like branching and versioning for data lakes. Create branches for experiments, merge data changes through PRs. Best for data lake workflows.
Q5: When should you use streaming vs batch processing for ML?
| Aspect | Batch Processing | Streaming Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data Volume | Process entire datasets | Process individual events |
| Complexity | Simpler (MapReduce, SQL) | More complex (windowing, watermarks, state management) |
| Cost | Lower (ephemeral compute) | Higher (always-on infrastructure) |
| Tools | Spark, dbt, Airflow | Kafka, Flink, Spark Streaming |
| ML Use Cases | Daily model training, batch predictions, offline feature computation | Real-time features, live anomaly detection, online model updates |
Decision framework:
- If freshness requirement is hours/days: use batch. Examples: nightly retraining, daily recommendation refresh, weekly churn prediction.
- If freshness requirement is seconds: use streaming. Examples: fraud detection at transaction time, real-time personalization, live content moderation.
- If you need both: use the Lambda architecture (batch layer for completeness + speed layer for low latency) or the Kappa architecture (everything through streaming, with replay capability).
Production reality: Most ML systems use a hybrid approach. Features like "user's lifetime purchase count" are computed in batch (daily). Features like "user's click count in the last 5 minutes" are computed via streaming. Both are stored in the feature store and served together at inference time.
Q6: How do you handle data labeling at scale?
Data labeling is often the bottleneck for ML projects. Scaling it requires a combination of human effort and automation:
- Labeling platforms: Use managed services (Labelbox, Scale AI, Amazon SageMaker Ground Truth) or open-source tools (Label Studio, Prodigy). These provide task management, quality control, and labeler interfaces.
- Active learning: Instead of labeling random samples, use the model to identify examples it is most uncertain about. Label those first for maximum learning efficiency. This can reduce labeling effort by 50–80% compared to random sampling.
- Weak supervision: Use programmatic labeling functions (Snorkel) that encode heuristics, regular expressions, and knowledge bases to generate approximate labels. Combine multiple weak labelers to produce training labels without manual annotation.
- Quality control: Include known-answer questions (gold standards) in the labeling stream. If a labeler gets gold standards wrong, flag their other labels for review. Use inter-annotator agreement (Cohen's kappa, Fleiss' kappa) to measure consistency. For subjective tasks, have 3+ labelers per example and use majority voting.
- Semi-supervised learning: Train on a small labeled dataset, then use the model to pseudo-label a large unlabeled dataset. Iterate: retrain on labeled + pseudo-labeled data, then re-pseudo-label. Effective when unlabeled data is abundant and labeled data is scarce.
Cost optimization: Tier your labeling: use LLMs (GPT-4, Claude) for initial bulk labeling ($0.01–0.05 per label), human reviewers for quality assurance ($0.10–0.50 per label), and domain experts for edge cases ($1–5 per label). Route easy examples to cheaper labeling methods and hard examples to experts.
Q7: What is data lineage and why do MLOps teams need it?
Data lineage tracks the origin, transformations, and destinations of data as it flows through the ML system. It answers: "Where did this data come from, what happened to it, and where did it go?"
Why it matters for ML:
- Root cause analysis: When a model degrades, lineage lets you trace back from the model's predictions to the exact data pipeline, transformation, and source table that might have caused the issue.
- Impact analysis: Before changing a data source or transformation, lineage shows which downstream models and features will be affected. "If I change this column name, which 15 models break?"
- Compliance: GDPR requires knowing where personal data is used. Lineage answers: "This model was trained on data that includes user X's information. To honor a deletion request, we need to retrain without that data."
- Debugging training-serving skew: Lineage shows if the training pipeline and serving pipeline use the same data transformations. If they diverge at any point, you have found the source of skew.
Implementation:
- Metadata catalogs: Use OpenMetadata, DataHub, or Amundsen to catalog datasets, tables, and features with ownership, schema, and lineage information.
- Pipeline-level tracking: Orchestrators like Airflow and Kubeflow can emit lineage events that link inputs to outputs for each pipeline step.
- Column-level lineage: Track not just table-to-table lineage but column-to-column. "The 'user_age' feature was derived from the 'birth_date' column in the 'users' table." dbt provides this natively.
Q8: How do you handle data skew and class imbalance in production ML?
Class imbalance (e.g., 99.5% non-fraud, 0.5% fraud) is the norm in production ML, not the exception. Here is how to handle it:
- Data-level approaches:
- Oversampling minority: SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority examples by interpolating between existing ones. Better than random duplication because it increases diversity.
- Undersampling majority: Randomly remove majority class examples. Simple but wastes data. Use informed undersampling (Tomek links, neighborhood cleaning rule) to remove only redundant or noisy majority examples.
- Stratified sampling: Ensure train/validation/test splits maintain the same class ratio as the full dataset.
- Algorithm-level approaches:
- Class weights: Assign higher loss weight to minority class examples. Most frameworks support this natively (scikit-learn's class_weight="balanced", PyTorch's CrossEntropyLoss with weight parameter).
- Focal loss: Down-weights well-classified examples and focuses on hard examples. Originally designed for object detection but works well for any imbalanced classification.
- Threshold tuning: Instead of using the default 0.5 threshold, tune the decision threshold on the validation set to optimize for your business metric (e.g., maximize recall at 95% precision).
- Evaluation: Never use accuracy for imbalanced datasets. A model that always predicts "not fraud" achieves 99.5% accuracy but is useless. Use precision, recall, F1, AUC-ROC, and precision-recall curves instead.
Q9: How do you build data pipelines that are resilient to upstream failures?
Upstream data sources will fail. Your pipeline must handle this gracefully without corrupting models or stopping predictions:
- Data contracts: Define explicit contracts with upstream data producers: schema, freshness SLA, volume expectations, and data quality guarantees. Document these contracts and set up automated validation.
- Idempotent operations: Every pipeline step should be safely re-runnable. Processing the same data twice should produce the same result. Use upsert (insert or update) instead of append to prevent duplicates.
- Retry with backoff: When an upstream API or database is temporarily unavailable, retry with exponential backoff (1s, 2s, 4s, 8s, max 5 minutes). Set a maximum retry count and alert after exhaustion.
- Dead letter queues: When a record cannot be processed (malformed, violates schema), route it to a dead letter queue for investigation instead of failing the entire pipeline. Process valid records; handle bad records separately.
- Fallback strategies: If fresh data is unavailable, fall back to the last known good dataset. The model continues serving with slightly stale data rather than going offline entirely. Set freshness alerts so the team knows data is stale.
- Circuit breakers: If an upstream source has been failing for more than N minutes, stop retrying and switch to fallback mode. This prevents cascade failures where your pipeline overwhelms a struggling upstream service with retries.
Testing resilience: Periodically inject failures into your pipeline (chaos engineering). Simulate upstream API timeout, data schema change, null-filled data, and late-arriving data. Verify that your pipeline handles each scenario correctly.
Q10: How do you handle PII (personally identifiable information) in ML data pipelines?
PII handling is a legal and ethical requirement. Getting it wrong can result in regulatory fines (GDPR: up to 4% of global revenue) and loss of user trust:
- Data classification: Tag every column in your data catalog as PII, quasi-PII, or non-PII. PII includes: names, email addresses, phone numbers, SSNs, IP addresses. Quasi-PII includes: zip codes, birth dates, gender (can identify individuals when combined).
- Anonymization techniques:
- Hashing: Replace PII with a one-way hash (SHA-256). Preserves the ability to join datasets without exposing raw values. Use a salt to prevent rainbow table attacks.
- Tokenization: Replace PII with a random token. A separate secure lookup table maps tokens back to real values. Only authorized services can access the lookup table.
- K-anonymity: Generalize quasi-identifiers (e.g., replace exact age with age ranges) so every record is indistinguishable from at least k-1 others.
- Differential privacy: Add calibrated noise to aggregated data or model outputs. Provides mathematical guarantees that individual records cannot be inferred. Used by Apple, Google, and the US Census Bureau.
- Access controls: Restrict PII access to essential personnel only. Use column-level encryption for PII fields in data warehouses. Log all PII access for audit trails.
- Data deletion: GDPR right to deletion means you must be able to remove a specific user's data from training datasets and retrain models. Design your pipeline so this is operationally feasible (hint: use user IDs that can be filtered, not raw PII baked into features).
- Training on PII: Whenever possible, train models on anonymized data. If PII features are necessary (e.g., a name entity recognizer), use federated learning or train in a secure enclave where data never leaves the trusted environment.
Lilly Tech Systems