Intermediate

Components of ML Systems

A comprehensive guide to components of ml systems within the context of ai architecture fundamentals.

The Components of Production ML Systems

A production machine learning system is composed of many interacting components, each with specific responsibilities. Understanding these components and their interactions is essential for designing systems that are reliable, maintainable, and scalable. This lesson maps out the complete landscape of ML system components.

High-Level Component Map

At the highest level, an ML system consists of five major subsystems:

Data management — Everything related to acquiring, storing, validating, and versioning data
Feature engineering — Transforming raw data into features suitable for model consumption
Model development — Training, evaluating, tuning, and selecting models
Model deployment — Getting models into production and serving predictions
Operations and monitoring — Keeping the system running and detecting problems

Data Management Components

Data management forms the foundation of any ML system. Without reliable, high-quality data, even the best model architecture will produce poor results.

Data Ingestion

The data ingestion layer handles acquiring data from various sources: databases, APIs, file systems, streaming platforms, and third-party providers. It must handle different formats (JSON, CSV, Parquet, Avro), different delivery mechanisms (push vs. pull), and different update frequencies (real-time, hourly, daily).

# Example: Data ingestion with validation
from great_expectations import DataContext

class DataIngestionPipeline:
    def __init__(self, source_config):
        self.source = self._create_source(source_config)
        self.validator = DataContext()

    def ingest(self):
        raw_data = self.source.read()
        # Validate before proceeding
        validation_result = self.validator.run_checkpoint(
            checkpoint_name="data_quality",
            batch_request={"data": raw_data}
        )
        if not validation_result.success:
            raise DataQualityError(validation_result)
        return raw_data

💡

Best practice: Always validate data at ingestion time. Catching data quality issues early prevents them from propagating through the entire pipeline and corrupting model training.

Data Storage

ML systems typically use multiple storage systems optimized for different access patterns:

Object storage (S3, GCS, ADLS) — Raw data, training datasets, model artifacts
Data warehouses (BigQuery, Redshift, Snowflake) — Structured analytical queries, reporting
Data lakes (Delta Lake, Iceberg, Hudi) — Large-scale analytics with ACID transactions
Feature stores (Feast, Tecton) — Low-latency feature serving for real-time inference
Vector databases (Pinecone, Weaviate, Milvus) — Embedding similarity search for RAG systems

Feature Engineering Components

Feature engineering transforms raw data into the numerical representations that models consume. This layer is critical because features directly determine what patterns the model can learn.

Feature Computation

Feature computation can be batch (processing large datasets periodically) or real-time (computing features on-demand for each prediction request). Most production systems use a combination of both approaches, often mediated by a feature store.

Feature Store

A feature store serves as the central repository for features, providing consistent feature values for both training and serving. It typically has two components: an offline store for batch access during training and an online store for low-latency access during inference.

# Feature store usage pattern
from feast import FeatureStore

store = FeatureStore(repo_path="./feature_repo")

# Training: get historical features (offline store)
training_df = store.get_historical_features(
    entity_df=entity_dataframe,
    features=["user_features:purchase_count_30d",
              "user_features:avg_session_duration"]
).to_df()

# Serving: get real-time features (online store)
features = store.get_online_features(
    features=["user_features:purchase_count_30d"],
    entity_rows=[{"user_id": "12345"}]
).to_dict()

Model Development Components

Model development requires infrastructure for training at scale, tracking experiments, tuning hyperparameters, and evaluating model quality.

Experiment Tracking

Experiment tracking systems (MLflow, Weights & Biases, Neptune) record every training run with its hyperparameters, metrics, artifacts, and code version. This enables reproducibility and makes it possible to compare runs systematically.

Model Registry

The model registry stores trained model artifacts along with metadata including training data version, performance metrics, approval status, and deployment history. It serves as the single source of truth for which models exist and which are approved for production use.

Deployment Components

Model deployment components handle getting trained models into production and serving predictions reliably.

Model serving framework — TensorFlow Serving, TorchServe, Triton, or custom REST/gRPC APIs
Container orchestration — Kubernetes with GPU support for scaling model servers
API gateway — Request routing, rate limiting, authentication, and model versioning
Load balancer — Distributing inference requests across model replicas

Operations and Monitoring

ML operations (MLOps) components ensure the system remains healthy and performant over time.

Data drift detection — Monitoring for changes in input data distributions
Model performance tracking — Tracking prediction accuracy against ground truth labels
System metrics — Latency, throughput, error rates, resource utilization
Alerting — Automated notifications when metrics breach thresholds
Logging — Structured logs for debugging and auditing prediction decisions

⚠

Critical: Model performance monitoring is not optional. Models degrade over time as the real world changes. Without monitoring, you will not know your model is making bad predictions until customers complain.

Understanding these components gives you the vocabulary and mental model needed for the rest of this course. In the next lesson, we will learn how to document architecture decisions using Architecture Decision Records.

← PreviousArchitecture Principles for AI Next →Architecture Decision Records