Cost Architecture
A comprehensive guide to cost architecture within the context of ai architecture fundamentals.
Cost Architecture for AI Systems
AI systems are among the most expensive software systems to build and operate. GPU compute for training, storage for massive datasets, serving infrastructure for low-latency inference, and the engineering talent to build it all add up quickly. Cost architecture is the discipline of designing AI systems to maximize the value delivered per dollar spent.
Unlike traditional software where compute costs are relatively predictable, AI costs can spike unexpectedly. A hyperparameter search that spawns 100 training runs overnight. A data pipeline bug that reprocesses 6 months of data. A model that requires 8 A100 GPUs when the previous version ran on 2. Cost architecture builds guardrails and visibility to prevent these surprises.
The Major Cost Drivers
Compute Costs
GPU compute is typically the largest cost in AI systems. Training costs depend on model size, dataset size, and the number of experiments. Serving costs depend on traffic volume, latency requirements, and model complexity.
- Training compute — GPU hours for model training and hyperparameter tuning
- Inference compute — GPU/CPU resources for serving predictions in production
- Feature compute — Processing resources for feature engineering pipelines
- Data processing — Spark/Beam clusters for ETL and data transformation
# Cost estimation for a training job
def estimate_training_cost(
gpu_type="A100",
num_gpus=4,
training_hours=8,
experiments=10
):
gpu_prices = {
"A100": 3.50, # $/hr on-demand
"V100": 2.48,
"T4": 0.526,
"A10G": 1.006,
}
cost_per_experiment = (
gpu_prices[gpu_type] * num_gpus * training_hours
)
total = cost_per_experiment * experiments
spot_total = total * 0.3 # ~70% savings with spot
print(f"On-demand: ${total:.2f}")
print(f"Spot instances: ${spot_total:.2f}")
return total
# Example: 10 experiments on 4x A100 for 8 hours each
estimate_training_cost()
# On-demand: $1120.00
# Spot instances: $336.00
Storage Costs
AI systems accumulate data rapidly: raw datasets, processed features, model artifacts, experiment logs, and prediction logs. Implement tiered storage strategies:
- Hot storage — Recent data and active model artifacts (SSD, high IOPS)
- Warm storage — Historical training data, older model versions (standard S3/GCS)
- Cold storage — Archived data for compliance (Glacier, Archive storage)
Cost Allocation and Chargebacks
In organizations with multiple ML teams, each team should be accountable for their infrastructure costs. Implement tagging strategies that associate every resource with a team, project, and environment. Use these tags for cost reporting and to identify teams whose spending is growing faster than their value delivery.
Budget Alerts and Guardrails
- Set daily and monthly spending alerts per team and per project
- Implement GPU quotas to prevent a single experiment from consuming all available GPUs
- Require approval for training jobs estimated to cost more than a threshold (e.g., $500)
- Auto-terminate idle GPU instances after 30 minutes of inactivity
- Review the top 10 most expensive jobs weekly
Optimizing Inference Costs
For many AI systems, inference costs exceed training costs because inference runs continuously while training is periodic. Key optimization strategies include model distillation (training a smaller model to mimic a larger one), quantization (reducing precision from FP32 to INT8), request batching, and intelligent caching.
# Example: Caching frequent predictions
from functools import lru_cache
import hashlib
class CachedPredictor:
def __init__(self, model, cache_size=10000):
self.model = model
self.predict_cached = lru_cache(maxsize=cache_size)(
self._predict_impl
)
def _predict_impl(self, feature_hash):
return self.model.predict(self._features[feature_hash])
def predict(self, features):
key = hashlib.md5(features.tobytes()).hexdigest()
self._features[key] = features
return self.predict_cached(key)
Cost-Performance Trade-offs
Every architecture decision involves a cost-performance trade-off. Using a larger model improves accuracy but increases serving costs. Real-time features improve relevance but are more expensive than batch features. The key is making these trade-offs explicitly and measuring the business impact of each decision.
The final lesson covers how to document all of these architecture decisions, components, and trade-offs in clear, maintainable architecture documentation.
Lilly Tech Systems