Beginner

Data Quality Fundamentals

Core principles of data quality for ML. Part of the Data Validation & Testing course at AI School by Lilly Tech Systems.

The Foundation of AI Quality: Data

In machine learning, the quality of your model is directly bounded by the quality of your data. No amount of sophisticated modeling can compensate for dirty, incomplete, or biased data. Data quality testing is the first and most impactful layer of any AI testing strategy. This lesson establishes the fundamental concepts you need to build a robust data quality practice.

Data quality issues are the number one cause of production ML failures. They are also the hardest to detect because they often do not cause crashes or errors — they silently degrade model performance. A data quality testing framework catches these issues before they reach your model.

The Six Dimensions of Data Quality

Data quality is not a single metric. It encompasses six distinct dimensions, each requiring different testing approaches:

  1. Completeness — Is all required data present? Are there missing values, gaps in time series, or absent categories?
  2. Accuracy — Does the data reflect reality? Are values correct and free from errors?
  3. Consistency — Is data consistent across sources? Do definitions match? Are units uniform?
  4. Timeliness — Is the data fresh enough for its intended use? Is there problematic lag?
  5. Validity — Does the data conform to its defined schema, types, and business rules?
  6. Uniqueness — Is data free from unwanted duplicates that could bias your model?
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List

@dataclass
class DataQualityResult:
    dimension: str
    metric: str
    value: float
    threshold: float
    passed: bool

def assess_data_quality(df: pd.DataFrame) -> List[DataQualityResult]:
    results = []

    # Completeness: check null ratios
    for col in df.columns:
        null_ratio = df[col].isnull().mean()
        results.append(DataQualityResult(
            dimension="completeness",
            metric=f"{col}_null_ratio",
            value=null_ratio,
            threshold=0.05,
            passed=null_ratio <= 0.05
        ))

    # Uniqueness: check duplicate rows
    dup_ratio = df.duplicated().mean()
    results.append(DataQualityResult(
        dimension="uniqueness",
        metric="duplicate_ratio",
        value=dup_ratio,
        threshold=0.01,
        passed=dup_ratio <= 0.01
    ))

    return results

Building a Data Quality Culture

Technical solutions alone are insufficient. Building a data quality culture means making every team member responsible for data quality. Data engineers own pipeline quality, data scientists own feature quality, and ML engineers own serving data quality. Define clear ownership boundaries and escalation paths.

Data Quality SLAs

Establish service-level agreements for data quality. Define what completeness ratio, freshness latency, and accuracy level are acceptable for each data source. Monitor these SLAs continuously and alert when they are violated.

💡
Key insight: The best time to catch a data quality issue is at the point of data entry or ingestion. The further downstream an issue propagates, the more expensive it is to detect and fix. Invest heavily in upstream data validation.

Data Quality Testing in the ML Lifecycle

Data quality tests should run at multiple points in the ML lifecycle:

  • At ingestion — Validate raw data as it enters your system
  • After transformation — Verify that preprocessing preserved data integrity
  • Before training — Confirm that the training dataset meets all quality requirements
  • During serving — Monitor that real-time input data matches the training distribution
  • After prediction — Validate that output data is within expected bounds

Tools for Data Quality Testing

Several mature tools exist for data quality testing. Great Expectations is the most popular Python framework, offering a declarative approach to data validation. Deequ provides data quality validation for Spark. Pandera offers lightweight schema validation for pandas DataFrames. Each tool has different strengths, and we will explore Great Expectations in detail in a later lesson.

Warning: Do not rely solely on schema validation for data quality. A column can have the correct type and still contain nonsensical values. Statistical tests and business rule validation are equally important layers in your data quality testing stack.