Intermediate

Cost Architecture

A comprehensive guide to cost architecture within the context of ai architecture fundamentals.

Cost Architecture for AI Systems

AI systems are among the most expensive software systems to build and operate. GPU compute for training, storage for massive datasets, serving infrastructure for low-latency inference, and the engineering talent to build it all add up quickly. Cost architecture is the discipline of designing AI systems to maximize the value delivered per dollar spent.

Unlike traditional software where compute costs are relatively predictable, AI costs can spike unexpectedly. A hyperparameter search that spawns 100 training runs overnight. A data pipeline bug that reprocesses 6 months of data. A model that requires 8 A100 GPUs when the previous version ran on 2. Cost architecture builds guardrails and visibility to prevent these surprises.

The Major Cost Drivers

Compute Costs

GPU compute is typically the largest cost in AI systems. Training costs depend on model size, dataset size, and the number of experiments. Serving costs depend on traffic volume, latency requirements, and model complexity.

Training compute — GPU hours for model training and hyperparameter tuning
Inference compute — GPU/CPU resources for serving predictions in production
Feature compute — Processing resources for feature engineering pipelines
Data processing — Spark/Beam clusters for ETL and data transformation

# Cost estimation for a training job
def estimate_training_cost(
    gpu_type="A100",
    num_gpus=4,
    training_hours=8,
    experiments=10
):
    gpu_prices = {
        "A100": 3.50,   # $/hr on-demand
        "V100": 2.48,
        "T4": 0.526,
        "A10G": 1.006,
    }
    cost_per_experiment = (
        gpu_prices[gpu_type] * num_gpus * training_hours
    )
    total = cost_per_experiment * experiments
    spot_total = total * 0.3  # ~70% savings with spot

    print(f"On-demand: ${total:.2f}")
    print(f"Spot instances: ${spot_total:.2f}")
    return total

# Example: 10 experiments on 4x A100 for 8 hours each
estimate_training_cost()
# On-demand: $1120.00
# Spot instances: $336.00

💡

Cost optimization tip: Use spot or preemptible instances for training workloads. With proper checkpointing, you can tolerate interruptions and save 60-90% on GPU costs. Never use spot for serving workloads where availability matters.

Storage Costs

AI systems accumulate data rapidly: raw datasets, processed features, model artifacts, experiment logs, and prediction logs. Implement tiered storage strategies:

Hot storage — Recent data and active model artifacts (SSD, high IOPS)
Warm storage — Historical training data, older model versions (standard S3/GCS)
Cold storage — Archived data for compliance (Glacier, Archive storage)

Cost Allocation and Chargebacks

In organizations with multiple ML teams, each team should be accountable for their infrastructure costs. Implement tagging strategies that associate every resource with a team, project, and environment. Use these tags for cost reporting and to identify teams whose spending is growing faster than their value delivery.

Budget Alerts and Guardrails

Set daily and monthly spending alerts per team and per project
Implement GPU quotas to prevent a single experiment from consuming all available GPUs
Require approval for training jobs estimated to cost more than a threshold (e.g., $500)
Auto-terminate idle GPU instances after 30 minutes of inactivity
Review the top 10 most expensive jobs weekly

Optimizing Inference Costs

For many AI systems, inference costs exceed training costs because inference runs continuously while training is periodic. Key optimization strategies include model distillation (training a smaller model to mimic a larger one), quantization (reducing precision from FP32 to INT8), request batching, and intelligent caching.

# Example: Caching frequent predictions
from functools import lru_cache
import hashlib

class CachedPredictor:
    def __init__(self, model, cache_size=10000):
        self.model = model
        self.predict_cached = lru_cache(maxsize=cache_size)(
            self._predict_impl
        )

    def _predict_impl(self, feature_hash):
        return self.model.predict(self._features[feature_hash])

    def predict(self, features):
        key = hashlib.md5(features.tobytes()).hexdigest()
        self._features[key] = features
        return self.predict_cached(key)

⚠

Hidden cost trap: Logging every prediction for monitoring can generate massive storage costs. Log a sample (e.g., 1-5%) for monitoring, and only log 100% when investigating specific issues. Structure your logging infrastructure to support sampling from the start.

Cost-Performance Trade-offs

Every architecture decision involves a cost-performance trade-off. Using a larger model improves accuracy but increases serving costs. Real-time features improve relevance but are more expensive than batch features. The key is making these trade-offs explicitly and measuring the business impact of each decision.

The final lesson covers how to document all of these architecture decisions, components, and trade-offs in clear, maintainable architecture documentation.

← PreviousScalability Considerations Next →Architecture Documentation