Multi-Provider Routing
The routing engine is the brain of your AI gateway. It decides which provider handles each request based on cost, latency, availability, and model capabilities. A well-designed router can cut costs by 40%, improve reliability to 99.99%, and route requests to the best model for each task — all transparently to the application.
Routing Strategy Overview
Production gateways combine multiple routing strategies. Each strategy serves a different goal:
| Strategy | Goal | When to Use |
|---|---|---|
| Round Robin | Even distribution | Multiple accounts with same provider to increase rate limits |
| Weighted | Proportional traffic split | A/B testing models, gradual migration between providers |
| Latency-Based | Fastest response | User-facing real-time applications |
| Cost-Based | Cheapest provider | Batch processing, internal tools, non-latency-sensitive tasks |
| Fallback Chain | Maximum reliability | Every production deployment (always have a fallback) |
| Capability Match | Right model for the task | Routing vision requests, long context, structured output |
Production Router Implementation
Here is a complete multi-strategy router that you can deploy in production. It supports fallback chains, latency tracking, cost optimization, and model capability matching:
import asyncio
import time
import random
from dataclasses import dataclass, field
from typing import Optional
from collections import deque
import httpx
@dataclass
class ProviderEndpoint:
"""Represents one AI provider endpoint."""
name: str # "openai-1", "anthropic-prod"
provider: str # "openai", "anthropic", "google", "local"
base_url: str
api_key: str
models: list[str] # Models available at this endpoint
cost_per_1k_input: float # Cost per 1K input tokens
cost_per_1k_output: float # Cost per 1K output tokens
max_context: int = 128000 # Max context window
supports_vision: bool = False
supports_streaming: bool = True
supports_json_mode: bool = True
weight: float = 1.0 # For weighted routing
priority: int = 1 # For fallback ordering (1 = highest)
# Runtime health tracking
is_healthy: bool = True
consecutive_failures: int = 0
latency_history: deque = field(default_factory=lambda: deque(maxlen=100))
@property
def avg_latency_ms(self) -> float:
if not self.latency_history:
return 999999
return sum(self.latency_history) / len(self.latency_history)
class AIRouter:
"""Multi-strategy router for AI gateway."""
def __init__(self, endpoints: list[ProviderEndpoint]):
self.endpoints = endpoints
self.round_robin_index = 0
def select(
self,
model: str,
strategy: str = "fallback",
requirements: Optional[dict] = None
) -> list[ProviderEndpoint]:
"""
Select ordered list of endpoints to try.
Returns list for fallback - try first, fall back to rest.
"""
requirements = requirements or {}
# Step 1: Filter to endpoints that support the requested model
candidates = [ep for ep in self.endpoints if self._matches(ep, model, requirements)]
if not candidates:
raise ValueError(f"No healthy endpoint supports model={model}")
# Step 2: Order by strategy
if strategy == "cost":
return self._sort_by_cost(candidates)
elif strategy == "latency":
return self._sort_by_latency(candidates)
elif strategy == "weighted":
return self._sort_by_weight(candidates)
elif strategy == "round-robin":
return self._round_robin(candidates)
else: # "fallback" - default
return self._sort_by_priority(candidates)
def _matches(self, ep: ProviderEndpoint, model: str, reqs: dict) -> bool:
"""Check if endpoint supports model and requirements."""
if not ep.is_healthy:
return False
if model not in ep.models and not self._model_alias_match(ep, model):
return False
if reqs.get("vision") and not ep.supports_vision:
return False
if reqs.get("json_mode") and not ep.supports_json_mode:
return False
if reqs.get("min_context", 0) > ep.max_context:
return False
return True
def _model_alias_match(self, ep: ProviderEndpoint, model: str) -> bool:
"""Handle model aliases: gpt-4o -> maps to any gpt-4o endpoint."""
return any(model in m or m in model for m in ep.models)
def _sort_by_cost(self, eps: list) -> list:
return sorted(eps, key=lambda e: e.cost_per_1k_input + e.cost_per_1k_output)
def _sort_by_latency(self, eps: list) -> list:
return sorted(eps, key=lambda e: e.avg_latency_ms)
def _sort_by_priority(self, eps: list) -> list:
return sorted(eps, key=lambda e: e.priority)
def _sort_by_weight(self, eps: list) -> list:
"""Weighted random selection - higher weight = more likely first."""
shuffled = eps.copy()
random.shuffle(shuffled) # Break ties randomly
return sorted(shuffled, key=lambda e: -e.weight)
def _round_robin(self, eps: list) -> list:
self.round_robin_index = (self.round_robin_index + 1) % len(eps)
idx = self.round_robin_index
return eps[idx:] + eps[:idx]
def record_success(self, endpoint: ProviderEndpoint, latency_ms: float):
"""Update endpoint health after successful request."""
endpoint.latency_history.append(latency_ms)
endpoint.consecutive_failures = 0
endpoint.is_healthy = True
def record_failure(self, endpoint: ProviderEndpoint):
"""Update endpoint health after failed request."""
endpoint.consecutive_failures += 1
if endpoint.consecutive_failures >= 3:
endpoint.is_healthy = False
# Schedule health check to re-enable
asyncio.get_event_loop().call_later(
30.0, self._health_check, endpoint
)
async def _health_check(self, endpoint: ProviderEndpoint):
"""Try a lightweight request to see if endpoint recovered."""
try:
async with httpx.AsyncClient() as client:
resp = await client.get(
f"{endpoint.base_url}/models",
headers={"Authorization": f"Bearer {endpoint.api_key}"},
timeout=5.0
)
if resp.status_code == 200:
endpoint.is_healthy = True
endpoint.consecutive_failures = 0
except Exception:
# Still unhealthy, check again later
asyncio.get_event_loop().call_later(
60.0, self._health_check, endpoint
)
Fallback Chains
Fallback chains are the most critical routing pattern. When your primary provider has an outage, the gateway automatically routes to the next provider with zero application changes:
async def execute_with_fallback(
router: AIRouter,
request_body: dict,
strategy: str = "fallback"
) -> dict:
"""Execute request with automatic fallback across providers."""
model = request_body["model"]
endpoints = router.select(model, strategy=strategy)
last_error = None
for endpoint in endpoints:
try:
start = time.time()
async with httpx.AsyncClient() as client:
# Translate request format if needed (OpenAI -> Anthropic)
translated = translate_request(request_body, endpoint.provider)
response = await client.post(
f"{endpoint.base_url}/chat/completions",
headers={
"Authorization": f"Bearer {endpoint.api_key}",
"Content-Type": "application/json",
},
json=translated,
timeout=120.0
)
if response.status_code == 200:
latency_ms = (time.time() - start) * 1000
router.record_success(endpoint, latency_ms)
result = response.json()
# Normalize response to OpenAI format
return normalize_response(result, endpoint.provider)
elif response.status_code == 429:
# Rate limited - try next provider immediately
router.record_failure(endpoint)
last_error = f"{endpoint.name}: rate limited"
continue
elif response.status_code >= 500:
# Server error - try next provider
router.record_failure(endpoint)
last_error = f"{endpoint.name}: {response.status_code}"
continue
else:
# Client error (400, 401) - don't retry, it's our fault
return response.json()
except httpx.TimeoutException:
router.record_failure(endpoint)
last_error = f"{endpoint.name}: timeout"
continue
except Exception as e:
router.record_failure(endpoint)
last_error = f"{endpoint.name}: {str(e)}"
continue
raise Exception(f"All providers failed. Last error: {last_error}")
def translate_request(body: dict, provider: str) -> dict:
"""Translate OpenAI format to provider-specific format."""
if provider == "openai":
return body
elif provider == "anthropic":
return {
"model": body["model"],
"max_tokens": body.get("max_tokens", 4096),
"messages": body["messages"],
"temperature": body.get("temperature", 1.0),
}
elif provider == "google":
return {
"contents": [
{"role": msg["role"], "parts": [{"text": msg["content"]}]}
for msg in body["messages"]
],
"generationConfig": {
"temperature": body.get("temperature", 1.0),
"maxOutputTokens": body.get("max_tokens", 4096),
}
}
return body
Cost-Based Routing Configuration
Cost-based routing can save 30-60% on API spend by routing non-critical requests to cheaper providers or models:
# Provider pricing configuration (as of early 2026)
PROVIDER_CONFIG = [
ProviderEndpoint(
name="openai-gpt4o",
provider="openai",
base_url="https://api.openai.com/v1",
api_key="sk-...",
models=["gpt-4o", "gpt-4o-2024-11-20"],
cost_per_1k_input=2.50, # $2.50 per 1M input tokens
cost_per_1k_output=10.00, # $10.00 per 1M output tokens
supports_vision=True,
priority=1,
),
ProviderEndpoint(
name="anthropic-sonnet",
provider="anthropic",
base_url="https://api.anthropic.com/v1",
api_key="sk-ant-...",
models=["claude-sonnet-4-20250514"],
cost_per_1k_input=3.00,
cost_per_1k_output=15.00,
supports_vision=True,
priority=2,
),
ProviderEndpoint(
name="google-flash",
provider="google",
base_url="https://generativelanguage.googleapis.com/v1beta",
api_key="AIza...",
models=["gemini-2.0-flash"],
cost_per_1k_input=0.10, # 25x cheaper than GPT-4o
cost_per_1k_output=0.40,
supports_vision=True,
max_context=1000000,
priority=3,
),
ProviderEndpoint(
name="local-llama",
provider="local",
base_url="http://gpu-server.internal:8080/v1",
api_key="local-key",
models=["llama-3.1-70b"],
cost_per_1k_input=0.0, # Free - your own GPU
cost_per_1k_output=0.0,
supports_vision=False,
max_context=128000,
priority=4,
),
]
# Route by use case
router = AIRouter(PROVIDER_CONFIG)
# Critical user-facing: fastest provider first
user_facing = router.select("gpt-4o", strategy="latency")
# Batch summarization: cheapest provider first
batch_job = router.select("gemini-2.0-flash", strategy="cost")
# Internal tools: free local model, fallback to cloud
internal = router.select("llama-3.1-70b", strategy="fallback")
Model Capability Matching
Not every model supports every feature. The router must match request requirements to model capabilities:
# Automatic capability-based routing
requirements = {
"vision": True, # Request includes images
"json_mode": True, # Needs structured JSON output
"min_context": 200000, # Needs 200K+ context window
}
# Router automatically filters to endpoints matching ALL requirements
# In this case: only Google Gemini supports vision + 200K+ context
endpoints = router.select(
model="auto", # Let gateway pick best model
strategy="cost",
requirements=requirements
)
# Returns: [google-flash] (only one matching all requirements)
Key Takeaways
- Combine multiple routing strategies: fallback for reliability, cost-based for batch jobs, latency-based for user-facing requests.
- Always configure at least two providers in fallback chains to handle outages transparently.
- Track endpoint health with consecutive failure counts and automatic circuit breaking after 3 failures.
- Cost-based routing between GPT-4o and Gemini Flash can save 25x on input costs for non-critical tasks.
- Capability matching prevents runtime errors by filtering endpoints based on vision, context length, and JSON mode support.
What Is Next
In the next lesson, we will build rate limiting and quota management — preventing any single team from exhausting your organization's API limits while ensuring fair allocation across all consumers.
Lilly Tech Systems