Production Deployment Patterns Advanced

Deploying a feature store to production is where architecture meets reality. This lesson covers the patterns used by companies serving millions of feature lookups per second: high-availability design, cross-region replication, performance benchmarking methodology, cost optimization strategies, and the monitoring infrastructure that keeps it all running.

High-Availability Architecture

Architecture

Production HA Feature Store (99.99% SLA Target)
=================================================

Region: us-east-1
+----------------------------------------------------------+
|  AZ-a                    AZ-b                    AZ-c     |
|  +----------------+     +----------------+     +--------+ |
|  | Redis Primary  |<--->| Redis Replica  |<--->| Replica| |
|  | (3 shards)     |     | (3 shards)     |     |        | |
|  +----------------+     +----------------+     +--------+ |
|         ^                       ^                         |
|         |                       |                         |
|  +----------------+     +----------------+               |
|  | Feature Server |     | Feature Server |               |
|  | (gRPC, 4 pods) |     | (gRPC, 4 pods) |               |
|  +----------------+     +----------------+               |
|         ^                       ^                         |
|         |        Load Balancer  |                         |
|         +----------+------------+                         |
|                    |                                      |
|  +----------------+----------------+                     |
|  | Materialization |  Registry     |                     |
|  | Workers (K8s)   |  (PostgreSQL  |                     |
|  | (Airflow DAGs)  |   HA pair)    |                     |
|  +-----------------+---------------+                     |
+----------------------------------------------------------+
         |
         | Cross-region replication (async)
         v
Region: us-west-2 (DR / read replicas)

Feature Serving API Design

Python

# Production gRPC feature serving endpoint
import grpc
from concurrent import futures
import time
from prometheus_client import Histogram, Counter

# Metrics
FEATURE_LATENCY = Histogram(
    "feature_serving_latency_seconds",
    "Feature serving latency",
    ["feature_view", "status"],
    buckets=[0.001, 0.002, 0.005, 0.01, 0.025, 0.05, 0.1],
)
FEATURE_REQUESTS = Counter(
    "feature_requests_total",
    "Total feature requests",
    ["feature_view", "status"],
)

class FeatureServingService:
    """Production feature serving with circuit breaker and fallbacks."""

    def __init__(self, online_store, fallback_store=None):
        self.store = online_store
        self.fallback = fallback_store
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=30,
        )

    def get_features(self, request):
        start = time.monotonic()
        feature_view = request.feature_view
        try:
            if self.circuit_breaker.is_open:
                # Primary store is down, use fallback
                return self._fallback_read(request)

            features = self.store.batch_read(
                feature_view=feature_view,
                entity_keys=request.entity_keys,
                feature_names=request.feature_names,
            )
            latency = time.monotonic() - start
            FEATURE_LATENCY.labels(feature_view, "success").observe(latency)
            FEATURE_REQUESTS.labels(feature_view, "success").inc()
            return features

        except Exception as e:
            self.circuit_breaker.record_failure()
            FEATURE_REQUESTS.labels(feature_view, "error").inc()
            # Return default values rather than failing the prediction
            return self._default_features(request)

    def _default_features(self, request):
        """Return population-level defaults when store is unavailable."""
        defaults = self.store.get_feature_defaults(request.feature_view)
        return [{name: defaults.get(name, 0.0)
                 for name in request.feature_names}
                for _ in request.entity_keys]

class CircuitBreaker:
    """Prevents cascading failures when online store is down."""
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = 0

    @property
    def is_open(self):
        if self.failure_count >= self.threshold:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.failure_count = 0  # Half-open: try again
                return False
            return True
        return False

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

Cross-Region Deployment

Python

# Cross-region Redis replication for global feature serving
import redis

class MultiRegionFeatureStore:
    """Routes feature reads to nearest region, writes to primary."""

    def __init__(self, regions: dict):
        # regions = {"us-east-1": "redis://...", "eu-west-1": "redis://..."}
        self.clients = {
            region: redis.RedisCluster(host=url)
            for region, url in regions.items()
        }
        self.primary_region = "us-east-1"

    def read_features(self, entity_key: str, feature_names: list,
                      region: str = None):
        """Read from nearest region for lowest latency."""
        client = self.clients.get(region, self.clients[self.primary_region])
        return client.hmget(f"fs:{entity_key}", feature_names)

    def write_features(self, entity_key: str, features: dict):
        """Write to primary. Replication handles distribution."""
        # AWS ElastiCache Global Datastore or Redis Enterprise
        # Active-Passive replication handles cross-region sync
        # Replication lag: typically 100-500ms cross-region
        primary = self.clients[self.primary_region]
        primary.hset(f"fs:{entity_key}", mapping=features)

# Cross-region replication options:
# 1. AWS ElastiCache Global Datastore (Redis) - managed, <1s lag
# 2. DynamoDB Global Tables - managed, ~1s lag, multi-active
# 3. Custom CDC pipeline (Debezium) - flexible but complex

Performance Benchmarking

Python

# Feature store load testing framework
import asyncio
import time
import numpy as np
from dataclasses import dataclass

@dataclass
class BenchmarkResult:
    qps: float               # Queries per second achieved
    p50_ms: float            # Median latency
    p95_ms: float            # 95th percentile latency
    p99_ms: float            # 99th percentile latency
    error_rate: float        # Percentage of failed requests
    total_requests: int

async def benchmark_feature_store(
    store,
    entity_keys: list,
    feature_names: list,
    target_qps: int = 10000,
    duration_seconds: int = 60,
) -> BenchmarkResult:
    """Load test the online feature store."""
    latencies = []
    errors = 0
    start_time = time.monotonic()
    interval = 1.0 / target_qps

    while time.monotonic() - start_time < duration_seconds:
        entity_key = np.random.choice(entity_keys)
        req_start = time.monotonic()
        try:
            await store.async_read_features(entity_key, feature_names)
            latencies.append((time.monotonic() - req_start) * 1000)
        except Exception:
            errors += 1
        await asyncio.sleep(max(0, interval - (time.monotonic() - req_start)))

    latency_array = np.array(latencies)
    elapsed = time.monotonic() - start_time

    return BenchmarkResult(
        qps=len(latencies) / elapsed,
        p50_ms=np.percentile(latency_array, 50),
        p95_ms=np.percentile(latency_array, 95),
        p99_ms=np.percentile(latency_array, 99),
        error_rate=errors / (len(latencies) + errors),
        total_requests=len(latencies) + errors,
    )

# Target benchmarks for production feature stores:
# Redis:    p50 < 0.5ms, p99 < 2ms,  10K+ QPS per shard
# DynamoDB: p50 < 2ms,   p99 < 5ms,  scales with RCU provisioning
# Bigtable: p50 < 3ms,   p99 < 10ms, scales with node count

Cost Optimization Strategies

Strategy	Savings	Trade-off
TTL-based eviction	30-50% memory	Features may be missing for inactive users
Binary encoding (protobuf)	40-60% memory	Slightly more CPU for encoding/decoding
Tiered storage	50-70% cost	Higher latency for cold-tier features (Redis on Flash)
Feature vector packing	20-30% memory	Must read all features even if only need a subset
DynamoDB on-demand mode	Variable (for bursty traffic)	Higher per-request cost vs provisioned at steady-state
Selective materialization	30-50% compute	Only high-value features in online store; others computed at request time

Monitoring Dashboard

YAML

# Grafana dashboard configuration for feature store monitoring
# Key panels to include:

panels:
  # 1. Feature Serving Latency (p50, p95, p99)
  - title: Feature Serving Latency
    query: |
      histogram_quantile(0.99,
        rate(feature_serving_latency_seconds_bucket[5m]))
    thresholds:
      - value: 0.005   # 5ms - warning
        color: yellow
      - value: 0.010   # 10ms - critical (SLA breach)
        color: red

  # 2. Feature Freshness
  - title: Feature Freshness (Time Since Last Materialization)
    query: |
      time() - feature_last_materialized_timestamp
    alert: "> 2h for hourly features, > 25h for daily features"

  # 3. Online Store Hit Rate
  - title: Cache Hit Rate
    query: |
      rate(feature_requests_total{status="success"}[5m]) /
      rate(feature_requests_total[5m])
    threshold: "< 99% = investigate"

  # 4. Null Feature Rate
  - title: Null Feature Rate by Feature View
    query: |
      rate(feature_null_total[5m]) /
      rate(feature_requests_total[5m])
    alert: "> 5% nulls = data pipeline issue"

  # 5. Materialization Pipeline Health
  - title: Materialization Job Status
    query: airflow_task_status{dag="feature_materialization"}
    alert: "2+ consecutive failures"

  # 6. Online Store Resource Utilization
  - title: Redis Memory / DynamoDB RCU Usage
    threshold: "> 80% = scale up"

SLA Tip: Set your feature store SLA at 99.95% availability (about 22 minutes downtime per month). Achieving 99.99% requires multi-region active-active deployment which 3-4x the infrastructure cost. Most ML models degrade gracefully with default feature values, so brief outages have limited impact on end-user experience.

Ready for Best Practices?

The final lesson covers build vs buy decisions, migration strategies, team ownership models, and a comprehensive FAQ.

Next: Best Practices & Checklist →

← Registry & Governance Best Practices →