Data Quality Fundamentals
Core principles of data quality for ML. Part of the Data Validation & Testing course at AI School by Lilly Tech Systems.
The Foundation of AI Quality: Data
In machine learning, the quality of your model is directly bounded by the quality of your data. No amount of sophisticated modeling can compensate for dirty, incomplete, or biased data. Data quality testing is the first and most impactful layer of any AI testing strategy. This lesson establishes the fundamental concepts you need to build a robust data quality practice.
Data quality issues are the number one cause of production ML failures. They are also the hardest to detect because they often do not cause crashes or errors — they silently degrade model performance. A data quality testing framework catches these issues before they reach your model.
The Six Dimensions of Data Quality
Data quality is not a single metric. It encompasses six distinct dimensions, each requiring different testing approaches:
- Completeness — Is all required data present? Are there missing values, gaps in time series, or absent categories?
- Accuracy — Does the data reflect reality? Are values correct and free from errors?
- Consistency — Is data consistent across sources? Do definitions match? Are units uniform?
- Timeliness — Is the data fresh enough for its intended use? Is there problematic lag?
- Validity — Does the data conform to its defined schema, types, and business rules?
- Uniqueness — Is data free from unwanted duplicates that could bias your model?
import pandas as pd
import numpy as np
from dataclasses import dataclass
from typing import Dict, List
@dataclass
class DataQualityResult:
dimension: str
metric: str
value: float
threshold: float
passed: bool
def assess_data_quality(df: pd.DataFrame) -> List[DataQualityResult]:
results = []
# Completeness: check null ratios
for col in df.columns:
null_ratio = df[col].isnull().mean()
results.append(DataQualityResult(
dimension="completeness",
metric=f"{col}_null_ratio",
value=null_ratio,
threshold=0.05,
passed=null_ratio <= 0.05
))
# Uniqueness: check duplicate rows
dup_ratio = df.duplicated().mean()
results.append(DataQualityResult(
dimension="uniqueness",
metric="duplicate_ratio",
value=dup_ratio,
threshold=0.01,
passed=dup_ratio <= 0.01
))
return results
Building a Data Quality Culture
Technical solutions alone are insufficient. Building a data quality culture means making every team member responsible for data quality. Data engineers own pipeline quality, data scientists own feature quality, and ML engineers own serving data quality. Define clear ownership boundaries and escalation paths.
Data Quality SLAs
Establish service-level agreements for data quality. Define what completeness ratio, freshness latency, and accuracy level are acceptable for each data source. Monitor these SLAs continuously and alert when they are violated.
Data Quality Testing in the ML Lifecycle
Data quality tests should run at multiple points in the ML lifecycle:
- At ingestion — Validate raw data as it enters your system
- After transformation — Verify that preprocessing preserved data integrity
- Before training — Confirm that the training dataset meets all quality requirements
- During serving — Monitor that real-time input data matches the training distribution
- After prediction — Validate that output data is within expected bounds
Tools for Data Quality Testing
Several mature tools exist for data quality testing. Great Expectations is the most popular Python framework, offering a declarative approach to data validation. Deequ provides data quality validation for Spark. Pandera offers lightweight schema validation for pandas DataFrames. Each tool has different strengths, and we will explore Great Expectations in detail in a later lesson.
Lilly Tech Systems