ML Inference Architecture Overview
Understand the three fundamental inference patterns — batch, near-real-time, and real-time — learn the latency budgets that production systems actually enforce, and survey the inference server landscape so you can make informed architecture decisions from day one.
What Is ML Inference?
ML inference is the process of running a trained model on new input data to produce predictions. Training happens once (or periodically); inference happens millions of times per day. In most production systems, inference accounts for 80-90% of total ML compute cost.
The engineering challenge is not just getting predictions — it is getting them fast enough, reliably enough, and cheaply enough to meet your product's requirements. A recommendation model that takes 5 seconds to respond is useless in a real-time feed. A fraud detection model that runs once per hour misses transactions happening right now.
Three Inference Patterns
Every ML serving system falls into one of three categories based on latency requirements and how requests arrive.
1. Batch Inference
Run predictions on a large dataset all at once, typically on a schedule (hourly, daily). Results are stored in a database or data warehouse for later consumption.
- Latency: Minutes to hours (not user-facing)
- Examples: Nightly product recommendations, weekly churn predictions, daily fraud scoring of all transactions
- Infrastructure: Spark, AWS Batch, Kubernetes Jobs, or simple cron + GPU instances
- When to use: Predictions do not need to reflect real-time data; result freshness of hours is acceptable
# Batch inference with PyTorch - process entire dataset
import torch
from torch.utils.data import DataLoader
model = torch.load("model.pt")
model.eval()
dataset = load_dataset("s3://data/daily-users.parquet")
loader = DataLoader(dataset, batch_size=256, num_workers=4)
results = []
with torch.no_grad():
for batch in loader:
predictions = model(batch.to("cuda"))
results.extend(predictions.cpu().tolist())
# Write results to database for downstream consumers
write_to_postgres(results, table="daily_predictions")
2. Near-Real-Time Inference (Streaming)
Process events as they arrive through a message queue or stream, with latency requirements in the range of 100ms to a few seconds. Results may be written to a database or pushed to a downstream service.
- Latency: 100ms – 5 seconds
- Examples: Transaction fraud scoring, content moderation on upload, real-time anomaly detection on metrics
- Infrastructure: Kafka + model server, AWS Lambda + SageMaker endpoint, Flink + embedded model
- When to use: Events arrive continuously; you need fresh predictions but can tolerate sub-second to low-second delays
# Near-real-time inference with Kafka consumer
from kafka import KafkaConsumer, KafkaProducer
import requests, json
consumer = KafkaConsumer(
"transactions",
bootstrap_servers="kafka:9092",
value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)
producer = KafkaProducer(
bootstrap_servers="kafka:9092",
value_serializer=lambda v: json.dumps(v).encode("utf-8")
)
for message in consumer:
txn = message.value
# Call model server with <100ms timeout
resp = requests.post(
"http://triton:8000/v2/models/fraud/infer",
json={"inputs": [{"name": "features", "data": txn["features"]}]},
timeout=0.1
)
score = resp.json()["outputs"][0]["data"][0]
txn["fraud_score"] = score
txn["is_fraud"] = score > 0.85
producer.send("fraud-results", value=txn)
3. Real-Time Inference (Online)
Synchronous request-response serving where the user is actively waiting. The model prediction is part of the critical path of a user-facing API call. This is the hardest pattern to get right.
- Latency: 1ms – 1 second (depends on use case)
- Examples: Ad click prediction, search ranking, autocomplete, chatbot responses, real-time translation
- Infrastructure: Dedicated model servers (Triton, TorchServe, vLLM) behind load balancers, with GPU pools
- When to use: The prediction is on the critical user-facing path; every millisecond of latency impacts user experience or revenue
# Real-time inference API with FastAPI + model server
from fastapi import FastAPI, HTTPException
import httpx, time
app = FastAPI()
MODEL_URL = "http://triton:8001/v2/models/ranker/infer"
@app.post("/api/search")
async def search(query: str, user_id: str):
start = time.monotonic()
# Step 1: Retrieve candidates (target: <20ms)
candidates = await retrieve_candidates(query, top_k=100)
# Step 2: Score with ML model (target: <30ms)
async with httpx.AsyncClient() as client:
resp = await client.post(MODEL_URL, json={
"inputs": [{
"name": "features",
"shape": [len(candidates), 128],
"datatype": "FP32",
"data": [c.features for c in candidates]
}]
}, timeout=0.05)
scores = resp.json()["outputs"][0]["data"]
# Step 3: Re-rank and return (target: <5ms)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
latency_ms = (time.monotonic() - start) * 1000
if latency_ms > 100:
log_slow_request(query, latency_ms) # Alert on SLA breach
return {"results": [c.to_dict() for c, _ in ranked[:10]]}
Latency Requirements by Use Case
Different products have dramatically different latency budgets. The following table shows real-world targets that production systems at major tech companies enforce. These numbers drive every architectural decision in your inference pipeline.
| Use Case | P50 Latency | P99 Latency | Why This Target |
|---|---|---|---|
| Ad Click Prediction | 5-10ms | 20ms | Ad auctions run in 10-100ms total; ML is one step in the pipeline |
| Search Ranking | 20-50ms | 100ms | Users expect instant results; Google found 100ms delay = 1% revenue loss |
| Recommendation Feed | 50-100ms | 200ms | Feed must feel instant on scroll; pre-fetch helps but ranking is real-time |
| Fraud Detection | 50-100ms | 500ms | Must complete before payment authorization timeout (typically 2-3s) |
| Chatbot / LLM | 200ms TTFT | 1s TTFT | Time-to-first-token matters more than total latency for streaming UX |
| Image Generation | 2-10s | 30s | Users accept longer waits for creative tasks; show progress indicator |
| Content Moderation | 100-500ms | 2s | Must complete before content is visible to other users |
| Autonomous Driving | <10ms | <30ms | Safety-critical; every millisecond affects stopping distance at speed |
The Inference Server Landscape
Choosing the right inference server is one of the most impactful decisions you will make. Here is a practical comparison of the major options as of 2025.
| Server | Best For | GPU Support | Key Features | Limitations |
|---|---|---|---|---|
| NVIDIA Triton | Multi-framework, multi-model serving | NVIDIA GPUs (CUDA) | Dynamic batching, model ensembles, concurrent model execution, multiple backends (ONNX, TensorRT, PyTorch, TF) | Complex configuration, NVIDIA-only GPU support |
| TorchServe | PyTorch models, teams already on PyTorch | NVIDIA, some CPU optimization | Native PyTorch integration, model versioning, built-in REST/gRPC, custom handlers | PyTorch-only, less optimized than Triton for multi-model |
| vLLM | LLM serving (text generation) | NVIDIA, AMD (ROCm) | PagedAttention for memory efficiency, continuous batching, OpenAI-compatible API, tensor parallelism | LLM-only, not for vision or embedding models |
| TGI (Text Generation Inference) | HuggingFace LLM serving | NVIDIA, AMD, Intel Gaudi | Flash Attention, quantization built-in, token streaming, speculative decoding | HuggingFace ecosystem dependency, LLM-focused |
| TensorFlow Serving | TensorFlow/Keras models | NVIDIA, TPU | Mature, battle-tested at Google, model versioning, batching | TensorFlow-only, ecosystem is shrinking |
| ONNX Runtime | Cross-framework optimized inference | NVIDIA, AMD, Intel, ARM, DirectML | Framework-agnostic, graph optimizations, widest hardware support | Requires ONNX export step, some ops may not convert |
| Ray Serve | Complex inference graphs, multi-step pipelines | Any (uses underlying framework) | Composable inference graphs, autoscaling, Python-native, good for ensembles | Higher overhead per request, Ray cluster management |
Anatomy of an Inference Request
Understanding where time is spent in a typical inference request helps you identify optimization targets. Here is the breakdown for a real-time image classification request:
# Typical latency breakdown for a single inference request
# Total budget: 50ms (P99)
# 1. Network hop: client -> load balancer -> model server
# Latency: 1-5ms (depends on region, VPC setup)
# 2. Request deserialization (JSON/protobuf -> tensors)
# Latency: 0.5-2ms
# 3. Preprocessing (resize, normalize, tokenize)
# Latency: 1-10ms (CPU-bound, can be a bottleneck)
# 4. GPU transfer (CPU RAM -> GPU VRAM)
# Latency: 0.1-1ms (PCIe bandwidth ~12 GB/s)
# 5. Model forward pass (the actual computation)
# Latency: 2-20ms (depends on model size, batch size, GPU)
# 6. GPU transfer back (GPU VRAM -> CPU RAM)
# Latency: 0.1-0.5ms
# 7. Postprocessing (softmax, top-k, formatting)
# Latency: 0.5-2ms
# 8. Response serialization
# Latency: 0.5-1ms
# Total: ~6-42ms for a well-optimized pipeline
# Key insight: preprocessing and model forward pass dominate
Choosing Your Inference Pattern
Use this decision framework to select the right inference pattern for your use case:
| Question | If Yes | If No |
|---|---|---|
| Is the user actively waiting for the result? | Real-time inference | Continue below |
| Do you need predictions within seconds of an event? | Near-real-time (streaming) | Continue below |
| Can you precompute predictions on a schedule? | Batch inference | Re-evaluate requirements |
| Do you need both precomputed and real-time? | Hybrid: batch + real-time fallback | Pick one pattern |
What Is Next
Now that you understand the three inference patterns, latency requirements, and the server landscape, the next lesson dives deep into Model Server Design. You will learn how TorchServe, Triton, vLLM, and TGI actually work internally — model loading, GPU memory management, multi-model serving, and how to write production Triton configurations.
Lilly Tech Systems