Production Deployment Patterns Advanced
Deploying a feature store to production is where architecture meets reality. This lesson covers the patterns used by companies serving millions of feature lookups per second: high-availability design, cross-region replication, performance benchmarking methodology, cost optimization strategies, and the monitoring infrastructure that keeps it all running.
High-Availability Architecture
Architecture
Production HA Feature Store (99.99% SLA Target)
=================================================
Region: us-east-1
+----------------------------------------------------------+
| AZ-a AZ-b AZ-c |
| +----------------+ +----------------+ +--------+ |
| | Redis Primary |<--->| Redis Replica |<--->| Replica| |
| | (3 shards) | | (3 shards) | | | |
| +----------------+ +----------------+ +--------+ |
| ^ ^ |
| | | |
| +----------------+ +----------------+ |
| | Feature Server | | Feature Server | |
| | (gRPC, 4 pods) | | (gRPC, 4 pods) | |
| +----------------+ +----------------+ |
| ^ ^ |
| | Load Balancer | |
| +----------+------------+ |
| | |
| +----------------+----------------+ |
| | Materialization | Registry | |
| | Workers (K8s) | (PostgreSQL | |
| | (Airflow DAGs) | HA pair) | |
| +-----------------+---------------+ |
+----------------------------------------------------------+
|
| Cross-region replication (async)
v
Region: us-west-2 (DR / read replicas)
Feature Serving API Design
Python
# Production gRPC feature serving endpoint import grpc from concurrent import futures import time from prometheus_client import Histogram, Counter # Metrics FEATURE_LATENCY = Histogram( "feature_serving_latency_seconds", "Feature serving latency", ["feature_view", "status"], buckets=[0.001, 0.002, 0.005, 0.01, 0.025, 0.05, 0.1], ) FEATURE_REQUESTS = Counter( "feature_requests_total", "Total feature requests", ["feature_view", "status"], ) class FeatureServingService: """Production feature serving with circuit breaker and fallbacks.""" def __init__(self, online_store, fallback_store=None): self.store = online_store self.fallback = fallback_store self.circuit_breaker = CircuitBreaker( failure_threshold=5, recovery_timeout=30, ) def get_features(self, request): start = time.monotonic() feature_view = request.feature_view try: if self.circuit_breaker.is_open: # Primary store is down, use fallback return self._fallback_read(request) features = self.store.batch_read( feature_view=feature_view, entity_keys=request.entity_keys, feature_names=request.feature_names, ) latency = time.monotonic() - start FEATURE_LATENCY.labels(feature_view, "success").observe(latency) FEATURE_REQUESTS.labels(feature_view, "success").inc() return features except Exception as e: self.circuit_breaker.record_failure() FEATURE_REQUESTS.labels(feature_view, "error").inc() # Return default values rather than failing the prediction return self._default_features(request) def _default_features(self, request): """Return population-level defaults when store is unavailable.""" defaults = self.store.get_feature_defaults(request.feature_view) return [{name: defaults.get(name, 0.0) for name in request.feature_names} for _ in request.entity_keys] class CircuitBreaker: """Prevents cascading failures when online store is down.""" def __init__(self, failure_threshold=5, recovery_timeout=30): self.failure_count = 0 self.threshold = failure_threshold self.recovery_timeout = recovery_timeout self.last_failure_time = 0 @property def is_open(self): if self.failure_count >= self.threshold: if time.time() - self.last_failure_time > self.recovery_timeout: self.failure_count = 0 # Half-open: try again return False return True return False def record_failure(self): self.failure_count += 1 self.last_failure_time = time.time()
Cross-Region Deployment
Python
# Cross-region Redis replication for global feature serving import redis class MultiRegionFeatureStore: """Routes feature reads to nearest region, writes to primary.""" def __init__(self, regions: dict): # regions = {"us-east-1": "redis://...", "eu-west-1": "redis://..."} self.clients = { region: redis.RedisCluster(host=url) for region, url in regions.items() } self.primary_region = "us-east-1" def read_features(self, entity_key: str, feature_names: list, region: str = None): """Read from nearest region for lowest latency.""" client = self.clients.get(region, self.clients[self.primary_region]) return client.hmget(f"fs:{entity_key}", feature_names) def write_features(self, entity_key: str, features: dict): """Write to primary. Replication handles distribution.""" # AWS ElastiCache Global Datastore or Redis Enterprise # Active-Passive replication handles cross-region sync # Replication lag: typically 100-500ms cross-region primary = self.clients[self.primary_region] primary.hset(f"fs:{entity_key}", mapping=features) # Cross-region replication options: # 1. AWS ElastiCache Global Datastore (Redis) - managed, <1s lag # 2. DynamoDB Global Tables - managed, ~1s lag, multi-active # 3. Custom CDC pipeline (Debezium) - flexible but complex
Performance Benchmarking
Python
# Feature store load testing framework import asyncio import time import numpy as np from dataclasses import dataclass @dataclass class BenchmarkResult: qps: float # Queries per second achieved p50_ms: float # Median latency p95_ms: float # 95th percentile latency p99_ms: float # 99th percentile latency error_rate: float # Percentage of failed requests total_requests: int async def benchmark_feature_store( store, entity_keys: list, feature_names: list, target_qps: int = 10000, duration_seconds: int = 60, ) -> BenchmarkResult: """Load test the online feature store.""" latencies = [] errors = 0 start_time = time.monotonic() interval = 1.0 / target_qps while time.monotonic() - start_time < duration_seconds: entity_key = np.random.choice(entity_keys) req_start = time.monotonic() try: await store.async_read_features(entity_key, feature_names) latencies.append((time.monotonic() - req_start) * 1000) except Exception: errors += 1 await asyncio.sleep(max(0, interval - (time.monotonic() - req_start))) latency_array = np.array(latencies) elapsed = time.monotonic() - start_time return BenchmarkResult( qps=len(latencies) / elapsed, p50_ms=np.percentile(latency_array, 50), p95_ms=np.percentile(latency_array, 95), p99_ms=np.percentile(latency_array, 99), error_rate=errors / (len(latencies) + errors), total_requests=len(latencies) + errors, ) # Target benchmarks for production feature stores: # Redis: p50 < 0.5ms, p99 < 2ms, 10K+ QPS per shard # DynamoDB: p50 < 2ms, p99 < 5ms, scales with RCU provisioning # Bigtable: p50 < 3ms, p99 < 10ms, scales with node count
Cost Optimization Strategies
| Strategy | Savings | Trade-off |
|---|---|---|
| TTL-based eviction | 30-50% memory | Features may be missing for inactive users |
| Binary encoding (protobuf) | 40-60% memory | Slightly more CPU for encoding/decoding |
| Tiered storage | 50-70% cost | Higher latency for cold-tier features (Redis on Flash) |
| Feature vector packing | 20-30% memory | Must read all features even if only need a subset |
| DynamoDB on-demand mode | Variable (for bursty traffic) | Higher per-request cost vs provisioned at steady-state |
| Selective materialization | 30-50% compute | Only high-value features in online store; others computed at request time |
Monitoring Dashboard
YAML
# Grafana dashboard configuration for feature store monitoring # Key panels to include: panels: # 1. Feature Serving Latency (p50, p95, p99) - title: Feature Serving Latency query: | histogram_quantile(0.99, rate(feature_serving_latency_seconds_bucket[5m])) thresholds: - value: 0.005 # 5ms - warning color: yellow - value: 0.010 # 10ms - critical (SLA breach) color: red # 2. Feature Freshness - title: Feature Freshness (Time Since Last Materialization) query: | time() - feature_last_materialized_timestamp alert: "> 2h for hourly features, > 25h for daily features" # 3. Online Store Hit Rate - title: Cache Hit Rate query: | rate(feature_requests_total{status="success"}[5m]) / rate(feature_requests_total[5m]) threshold: "< 99% = investigate" # 4. Null Feature Rate - title: Null Feature Rate by Feature View query: | rate(feature_null_total[5m]) / rate(feature_requests_total[5m]) alert: "> 5% nulls = data pipeline issue" # 5. Materialization Pipeline Health - title: Materialization Job Status query: airflow_task_status{dag="feature_materialization"} alert: "2+ consecutive failures" # 6. Online Store Resource Utilization - title: Redis Memory / DynamoDB RCU Usage threshold: "> 80% = scale up"
SLA Tip: Set your feature store SLA at 99.95% availability (about 22 minutes downtime per month). Achieving 99.99% requires multi-region active-active deployment which 3-4x the infrastructure cost. Most ML models degrade gracefully with default feature values, so brief outages have limited impact on end-user experience.
Ready for Best Practices?
The final lesson covers build vs buy decisions, migration strategies, team ownership models, and a comprehensive FAQ.
Next: Best Practices & Checklist →
Lilly Tech Systems