Prepare Data for Modeling Intermediate
Data preparation is where most real-world ML time is spent. This lesson covers data pipelines in Azure ML, working with datastores and datasets, data profiling, handling missing and inconsistent data, and detecting data drift — all tested on the DP-100 exam.
Working with Datastores
Datastores are references to storage services. They store connection information so you do not need to hard-code credentials in your scripts.
# Register an Azure Blob Storage datastore
from azure.ai.ml.entities import AzureBlobDatastore, AccountKeyConfiguration
blob_datastore = AzureBlobDatastore(
name="training-data-store",
description="Blob storage for training datasets",
account_name="dp100storageaccount",
container_name="training-data",
credentials=AccountKeyConfiguration(
account_key="your-account-key" # Stored in Key Vault in production
)
)
ml_client.datastores.create_or_update(blob_datastore)
# Register Azure Data Lake Gen2 datastore
from azure.ai.ml.entities import AzureDataLakeGen2Datastore
adls_datastore = AzureDataLakeGen2Datastore(
name="lake-data-store",
description="Data Lake for large-scale datasets",
account_name="dp100datalake",
filesystem="ml-data"
# Uses workspace managed identity for auth (recommended)
)
ml_client.datastores.create_or_update(adls_datastore)
Data Versioning
Azure ML data assets support versioning, enabling you to track which data was used for each experiment.
# Create versioned data assets
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
# Version 1 - initial dataset
data_v1 = Data(
name="customer-data",
version="1",
description="Customer dataset - January 2026",
path="azureml://datastores/training-data-store/paths/customer/jan2026.csv",
type=AssetTypes.URI_FILE
)
ml_client.data.create_or_update(data_v1)
# Version 2 - updated dataset
data_v2 = Data(
name="customer-data",
version="2",
description="Customer dataset - February 2026 (added new features)",
path="azureml://datastores/training-data-store/paths/customer/feb2026.csv",
type=AssetTypes.URI_FILE
)
ml_client.data.create_or_update(data_v2)
# Reference specific version in a job
from azure.ai.ml import Input
job_input = Input(
type="uri_file",
path="azureml:customer-data:2" # Use version 2
)
MLTable for Structured Data
MLTable defines schema, transformations, and column types for tabular data. It is especially useful when working with AutoML and pipeline components.
# MLTable definition file (MLTable file in the data folder)
# File: ./data/churn-mltable/MLTable
paths:
- file: ../churn-data.csv
transformations:
- read_delimited:
delimiter: ","
header: all_files_same_headers
encoding: utf8
- convert_column_types:
- columns: age
column_type: int
- columns: monthly_charges
column_type: float
- columns: churn
column_type: boolean
- drop_columns:
- customer_id # Not useful for modeling
- phone_number # PII - should not be in training data
# Load and use MLTable in Python
import mltable
# Load from local path
tbl = mltable.load("./data/churn-mltable/")
df = tbl.to_pandas_dataframe()
print(df.head())
print(df.dtypes)
# Load from registered data asset
data_asset = ml_client.data.get("customer-churn-mltable", version="1")
tbl = mltable.load(data_asset.path)
df = tbl.to_pandas_dataframe()
Data Preparation in Pipelines
Production data preparation should be part of a pipeline, not a manual notebook step. This ensures reproducibility and automation.
# Data preparation component for a pipeline
# File: src/prep_data.py
import argparse
import pandas as pd
import numpy as np
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input-data", type=str)
parser.add_argument("--output-train", type=str)
parser.add_argument("--output-test", type=str)
parser.add_argument("--test-size", type=float, default=0.2)
args = parser.parse_args()
# Load data
df = pd.read_csv(args.input_data)
print(f"Input shape: {df.shape}")
# Handle missing values
# Numeric: fill with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
if df[col].isnull().sum() > 0:
median_val = df[col].median()
df[col].fillna(median_val, inplace=True)
print(f" Filled {col} nulls with median: {median_val}")
# Categorical: fill with mode
cat_cols = df.select_dtypes(include=['object']).columns
for col in cat_cols:
if df[col].isnull().sum() > 0:
mode_val = df[col].mode()[0]
df[col].fillna(mode_val, inplace=True)
print(f" Filled {col} nulls with mode: {mode_val}")
# Remove duplicates
before = len(df)
df.drop_duplicates(inplace=True)
print(f" Removed {before - len(df)} duplicate rows")
# Encode categorical variables
df = pd.get_dummies(df, drop_first=True)
# Split train/test
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(
df, test_size=args.test_size, random_state=42
)
# Save outputs
Path(args.output_train).mkdir(parents=True, exist_ok=True)
Path(args.output_test).mkdir(parents=True, exist_ok=True)
train_df.to_csv(f"{args.output_train}/train.csv", index=False)
test_df.to_csv(f"{args.output_test}/test.csv", index=False)
print(f"Train: {train_df.shape}, Test: {test_df.shape}")
if __name__ == "__main__":
main()
Data Drift Detection
Data drift occurs when production data differs from training data. Azure ML provides built-in data drift monitoring for deployed models.
| Drift Type | What Changes | Impact | Detection Method |
|---|---|---|---|
| Data drift | Input feature distributions | Model predictions become unreliable | Statistical tests (KS test, PSI) |
| Concept drift | Relationship between features and target | Model accuracy degrades | Monitor prediction quality over time |
| Upstream data changes | Data pipeline outputs | Missing features, schema changes | Schema validation, data quality checks |
# Configure data drift monitoring
from azure.ai.ml.entities import (
MonitoringTarget,
MonitorDefinition,
MonitorSchedule,
DataDriftSignal,
ProductionData,
ReferenceData
)
# Define what to monitor
monitor_definition = MonitorDefinition(
compute="dp100-cluster",
monitoring_target=MonitoringTarget(
endpoint_deployment_id="/subscriptions/.../endpoints/churn-endpoint/deployments/blue"
),
monitoring_signals={
"data_drift": DataDriftSignal(
production_data=ProductionData(
input_data=Input(
type="uri_folder",
path="azureml://datastores/workspaceblobstore/paths/production-data/"
),
data_context="model_inputs"
),
reference_data=ReferenceData(
input_data=Input(
type="mltable",
path="azureml:customer-churn-mltable:1"
),
data_context="training"
),
metric_thresholds={
"numerical": {"jensen_shannon_distance": 0.1},
"categorical": {"pearsons_chi_squared_test": 0.05}
}
)
}
)
# Schedule monitoring
monitor_schedule = MonitorSchedule(
name="churn-data-drift-monitor",
trigger=RecurrenceTrigger(frequency="week", interval=1),
create_monitor=monitor_definition
)
ml_client.schedules.begin_create_or_update(monitor_schedule)
Practice Questions
A. Store credentials in the training script as environment variables
B. Register an Azure ML datastore with Key Vault-backed credentials
C. Hard-code the storage account key in the pipeline YAML
D. Use a shared access signature (SAS) URL directly in the script
Show Answer
B. Register an Azure ML datastore with Key Vault-backed credentials. Datastores abstract away credential management. When you register a datastore, Azure ML stores the credentials in the associated Key Vault. Your training scripts reference the datastore name, never touching credentials directly. This is the recommended and most secure approach.
A. URI File
B. URI Folder
C. MLTable
D. Azure Open Dataset
Show Answer
C. MLTable. MLTable supports schema definition including column types, transformations like dropping columns, and reading options. URI File and URI Folder are references to raw files without schema information. MLTable is specifically designed for structured tabular data with transformations.
A. Data drift
B. Concept drift
C. Upstream data changes
D. Feature drift
Show Answer
B. Concept drift. When input features remain statistically similar but model accuracy declines, the relationship between features and the target variable has changed. This is concept drift. Data drift would show statistical differences in the input features themselves. Upstream data changes would manifest as schema issues or missing features.
A. Include the file path in the experiment description
B. Use versioned data assets and reference them by name and version in the job
C. Copy the dataset into the experiment output folder
D. Log the file hash as a custom metric
Show Answer
B. Use versioned data assets and reference them by name and version in the job. Azure ML data assets support versioning natively. When you reference azureml:customer-data:2 in a job input, Azure ML automatically records this lineage. This is the standard approach for data versioning and reproducibility in Azure ML.
A. Faster data access speeds
B. No credentials to store or rotate
C. Automatic data encryption
D. Support for larger file sizes
Show Answer
B. No credentials to store or rotate. Managed identity authentication eliminates the need to store secrets in Key Vault or manage credential rotation. The workspace's managed identity is granted access to the storage account via Azure RBAC. This is the most secure and lowest-maintenance authentication method for Azure ML datastores.
Lilly Tech Systems