Prepare Data for Modeling Intermediate

Data preparation is where most real-world ML time is spent. This lesson covers data pipelines in Azure ML, working with datastores and datasets, data profiling, handling missing and inconsistent data, and detecting data drift — all tested on the DP-100 exam.

Working with Datastores

Datastores are references to storage services. They store connection information so you do not need to hard-code credentials in your scripts.

# Register an Azure Blob Storage datastore
from azure.ai.ml.entities import AzureBlobDatastore, AccountKeyConfiguration

blob_datastore = AzureBlobDatastore(
    name="training-data-store",
    description="Blob storage for training datasets",
    account_name="dp100storageaccount",
    container_name="training-data",
    credentials=AccountKeyConfiguration(
        account_key="your-account-key"   # Stored in Key Vault in production
    )
)

ml_client.datastores.create_or_update(blob_datastore)

# Register Azure Data Lake Gen2 datastore
from azure.ai.ml.entities import AzureDataLakeGen2Datastore

adls_datastore = AzureDataLakeGen2Datastore(
    name="lake-data-store",
    description="Data Lake for large-scale datasets",
    account_name="dp100datalake",
    filesystem="ml-data"
    # Uses workspace managed identity for auth (recommended)
)

ml_client.datastores.create_or_update(adls_datastore)
Exam Tip: Know the authentication methods for datastores. Account key: simplest, stored in Key Vault. SAS token: scoped access with expiration. Service principal: for ADLS Gen2 and enterprise scenarios. Managed identity: recommended for production, no credentials to manage. The exam often tests which auth method is appropriate for a given security requirement.

Data Versioning

Azure ML data assets support versioning, enabling you to track which data was used for each experiment.

# Create versioned data assets
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Version 1 - initial dataset
data_v1 = Data(
    name="customer-data",
    version="1",
    description="Customer dataset - January 2026",
    path="azureml://datastores/training-data-store/paths/customer/jan2026.csv",
    type=AssetTypes.URI_FILE
)
ml_client.data.create_or_update(data_v1)

# Version 2 - updated dataset
data_v2 = Data(
    name="customer-data",
    version="2",
    description="Customer dataset - February 2026 (added new features)",
    path="azureml://datastores/training-data-store/paths/customer/feb2026.csv",
    type=AssetTypes.URI_FILE
)
ml_client.data.create_or_update(data_v2)

# Reference specific version in a job
from azure.ai.ml import Input
job_input = Input(
    type="uri_file",
    path="azureml:customer-data:2"   # Use version 2
)

MLTable for Structured Data

MLTable defines schema, transformations, and column types for tabular data. It is especially useful when working with AutoML and pipeline components.

# MLTable definition file (MLTable file in the data folder)
# File: ./data/churn-mltable/MLTable
paths:
  - file: ../churn-data.csv

transformations:
  - read_delimited:
      delimiter: ","
      header: all_files_same_headers
      encoding: utf8
  - convert_column_types:
      - columns: age
        column_type: int
      - columns: monthly_charges
        column_type: float
      - columns: churn
        column_type: boolean
  - drop_columns:
      - customer_id           # Not useful for modeling
      - phone_number          # PII - should not be in training data
# Load and use MLTable in Python
import mltable

# Load from local path
tbl = mltable.load("./data/churn-mltable/")
df = tbl.to_pandas_dataframe()
print(df.head())
print(df.dtypes)

# Load from registered data asset
data_asset = ml_client.data.get("customer-churn-mltable", version="1")
tbl = mltable.load(data_asset.path)
df = tbl.to_pandas_dataframe()

Data Preparation in Pipelines

Production data preparation should be part of a pipeline, not a manual notebook step. This ensures reproducibility and automation.

# Data preparation component for a pipeline
# File: src/prep_data.py
import argparse
import pandas as pd
import numpy as np
from pathlib import Path

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-data", type=str)
    parser.add_argument("--output-train", type=str)
    parser.add_argument("--output-test", type=str)
    parser.add_argument("--test-size", type=float, default=0.2)
    args = parser.parse_args()

    # Load data
    df = pd.read_csv(args.input_data)
    print(f"Input shape: {df.shape}")

    # Handle missing values
    # Numeric: fill with median
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df[col].isnull().sum() > 0:
            median_val = df[col].median()
            df[col].fillna(median_val, inplace=True)
            print(f"  Filled {col} nulls with median: {median_val}")

    # Categorical: fill with mode
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        if df[col].isnull().sum() > 0:
            mode_val = df[col].mode()[0]
            df[col].fillna(mode_val, inplace=True)
            print(f"  Filled {col} nulls with mode: {mode_val}")

    # Remove duplicates
    before = len(df)
    df.drop_duplicates(inplace=True)
    print(f"  Removed {before - len(df)} duplicate rows")

    # Encode categorical variables
    df = pd.get_dummies(df, drop_first=True)

    # Split train/test
    from sklearn.model_selection import train_test_split
    train_df, test_df = train_test_split(
        df, test_size=args.test_size, random_state=42
    )

    # Save outputs
    Path(args.output_train).mkdir(parents=True, exist_ok=True)
    Path(args.output_test).mkdir(parents=True, exist_ok=True)
    train_df.to_csv(f"{args.output_train}/train.csv", index=False)
    test_df.to_csv(f"{args.output_test}/test.csv", index=False)
    print(f"Train: {train_df.shape}, Test: {test_df.shape}")

if __name__ == "__main__":
    main()

Data Drift Detection

Data drift occurs when production data differs from training data. Azure ML provides built-in data drift monitoring for deployed models.

Drift TypeWhat ChangesImpactDetection Method
Data driftInput feature distributionsModel predictions become unreliableStatistical tests (KS test, PSI)
Concept driftRelationship between features and targetModel accuracy degradesMonitor prediction quality over time
Upstream data changesData pipeline outputsMissing features, schema changesSchema validation, data quality checks
# Configure data drift monitoring
from azure.ai.ml.entities import (
    MonitoringTarget,
    MonitorDefinition,
    MonitorSchedule,
    DataDriftSignal,
    ProductionData,
    ReferenceData
)

# Define what to monitor
monitor_definition = MonitorDefinition(
    compute="dp100-cluster",
    monitoring_target=MonitoringTarget(
        endpoint_deployment_id="/subscriptions/.../endpoints/churn-endpoint/deployments/blue"
    ),
    monitoring_signals={
        "data_drift": DataDriftSignal(
            production_data=ProductionData(
                input_data=Input(
                    type="uri_folder",
                    path="azureml://datastores/workspaceblobstore/paths/production-data/"
                ),
                data_context="model_inputs"
            ),
            reference_data=ReferenceData(
                input_data=Input(
                    type="mltable",
                    path="azureml:customer-churn-mltable:1"
                ),
                data_context="training"
            ),
            metric_thresholds={
                "numerical": {"jensen_shannon_distance": 0.1},
                "categorical": {"pearsons_chi_squared_test": 0.05}
            }
        )
    }
)

# Schedule monitoring
monitor_schedule = MonitorSchedule(
    name="churn-data-drift-monitor",
    trigger=RecurrenceTrigger(frequency="week", interval=1),
    create_monitor=monitor_definition
)

ml_client.schedules.begin_create_or_update(monitor_schedule)

Practice Questions

Question 1: You have a dataset stored in Azure Blob Storage. You need to ensure that the connection credentials are stored securely and not embedded in training scripts. What should you use?

A. Store credentials in the training script as environment variables
B. Register an Azure ML datastore with Key Vault-backed credentials
C. Hard-code the storage account key in the pipeline YAML
D. Use a shared access signature (SAS) URL directly in the script

Show Answer

B. Register an Azure ML datastore with Key Vault-backed credentials. Datastores abstract away credential management. When you register a datastore, Azure ML stores the credentials in the associated Key Vault. Your training scripts reference the datastore name, never touching credentials directly. This is the recommended and most secure approach.

Question 2: You need to define a tabular dataset with specific column types and drop certain columns before training. Which Azure ML data asset type should you use?

A. URI File
B. URI Folder
C. MLTable
D. Azure Open Dataset

Show Answer

C. MLTable. MLTable supports schema definition including column types, transformations like dropping columns, and reading options. URI File and URI Folder are references to raw files without schema information. MLTable is specifically designed for structured tabular data with transformations.

Question 3: Your deployed model's prediction accuracy has been declining over the past month, but the input features look statistically similar to training data. Which type of drift is most likely occurring?

A. Data drift
B. Concept drift
C. Upstream data changes
D. Feature drift

Show Answer

B. Concept drift. When input features remain statistically similar but model accuracy declines, the relationship between features and the target variable has changed. This is concept drift. Data drift would show statistical differences in the input features themselves. Upstream data changes would manifest as schema issues or missing features.

Question 4: You have multiple versions of a training dataset. You need to ensure that each experiment records which version was used. What should you do?

A. Include the file path in the experiment description
B. Use versioned data assets and reference them by name and version in the job
C. Copy the dataset into the experiment output folder
D. Log the file hash as a custom metric

Show Answer

B. Use versioned data assets and reference them by name and version in the job. Azure ML data assets support versioning natively. When you reference azureml:customer-data:2 in a job input, Azure ML automatically records this lineage. This is the standard approach for data versioning and reproducibility in Azure ML.

Question 5: You want to use managed identity authentication for a Data Lake Gen2 datastore instead of account keys. What is the primary benefit?

A. Faster data access speeds
B. No credentials to store or rotate
C. Automatic data encryption
D. Support for larger file sizes

Show Answer

B. No credentials to store or rotate. Managed identity authentication eliminates the need to store secrets in Key Vault or manage credential rotation. The workspace's managed identity is granted access to the storage account via Azure RBAC. This is the most secure and lowest-maintenance authentication method for Azure ML datastores.