Beginner

ML Inference Architecture Overview

Understand the three fundamental inference patterns — batch, near-real-time, and real-time — learn the latency budgets that production systems actually enforce, and survey the inference server landscape so you can make informed architecture decisions from day one.

What Is ML Inference?

ML inference is the process of running a trained model on new input data to produce predictions. Training happens once (or periodically); inference happens millions of times per day. In most production systems, inference accounts for 80-90% of total ML compute cost.

The engineering challenge is not just getting predictions — it is getting them fast enough, reliably enough, and cheaply enough to meet your product's requirements. A recommendation model that takes 5 seconds to respond is useless in a real-time feed. A fraud detection model that runs once per hour misses transactions happening right now.

💡

Production reality: Most ML engineering effort goes into inference, not training. You train a model once over days or weeks, but you serve it millions of times per second for months. Inference system design is where most of the real engineering complexity lives.

Three Inference Patterns

Every ML serving system falls into one of three categories based on latency requirements and how requests arrive.

1. Batch Inference

Run predictions on a large dataset all at once, typically on a schedule (hourly, daily). Results are stored in a database or data warehouse for later consumption.

Latency: Minutes to hours (not user-facing)
Examples: Nightly product recommendations, weekly churn predictions, daily fraud scoring of all transactions
Infrastructure: Spark, AWS Batch, Kubernetes Jobs, or simple cron + GPU instances
When to use: Predictions do not need to reflect real-time data; result freshness of hours is acceptable

# Batch inference with PyTorch - process entire dataset
import torch
from torch.utils.data import DataLoader

model = torch.load("model.pt")
model.eval()

dataset = load_dataset("s3://data/daily-users.parquet")
loader = DataLoader(dataset, batch_size=256, num_workers=4)

results = []
with torch.no_grad():
    for batch in loader:
        predictions = model(batch.to("cuda"))
        results.extend(predictions.cpu().tolist())

# Write results to database for downstream consumers
write_to_postgres(results, table="daily_predictions")

2. Near-Real-Time Inference (Streaming)

Process events as they arrive through a message queue or stream, with latency requirements in the range of 100ms to a few seconds. Results may be written to a database or pushed to a downstream service.

Latency: 100ms – 5 seconds
Examples: Transaction fraud scoring, content moderation on upload, real-time anomaly detection on metrics
Infrastructure: Kafka + model server, AWS Lambda + SageMaker endpoint, Flink + embedded model
When to use: Events arrive continuously; you need fresh predictions but can tolerate sub-second to low-second delays

# Near-real-time inference with Kafka consumer
from kafka import KafkaConsumer, KafkaProducer
import requests, json

consumer = KafkaConsumer(
    "transactions",
    bootstrap_servers="kafka:9092",
    value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)
producer = KafkaProducer(
    bootstrap_servers="kafka:9092",
    value_serializer=lambda v: json.dumps(v).encode("utf-8")
)

for message in consumer:
    txn = message.value
    # Call model server with <100ms timeout
    resp = requests.post(
        "http://triton:8000/v2/models/fraud/infer",
        json={"inputs": [{"name": "features", "data": txn["features"]}]},
        timeout=0.1
    )
    score = resp.json()["outputs"][0]["data"][0]
    txn["fraud_score"] = score
    txn["is_fraud"] = score > 0.85
    producer.send("fraud-results", value=txn)

3. Real-Time Inference (Online)

Synchronous request-response serving where the user is actively waiting. The model prediction is part of the critical path of a user-facing API call. This is the hardest pattern to get right.

Latency: 1ms – 1 second (depends on use case)
Examples: Ad click prediction, search ranking, autocomplete, chatbot responses, real-time translation
Infrastructure: Dedicated model servers (Triton, TorchServe, vLLM) behind load balancers, with GPU pools
When to use: The prediction is on the critical user-facing path; every millisecond of latency impacts user experience or revenue

# Real-time inference API with FastAPI + model server
from fastapi import FastAPI, HTTPException
import httpx, time

app = FastAPI()
MODEL_URL = "http://triton:8001/v2/models/ranker/infer"

@app.post("/api/search")
async def search(query: str, user_id: str):
    start = time.monotonic()

    # Step 1: Retrieve candidates (target: <20ms)
    candidates = await retrieve_candidates(query, top_k=100)

    # Step 2: Score with ML model (target: <30ms)
    async with httpx.AsyncClient() as client:
        resp = await client.post(MODEL_URL, json={
            "inputs": [{
                "name": "features",
                "shape": [len(candidates), 128],
                "datatype": "FP32",
                "data": [c.features for c in candidates]
            }]
        }, timeout=0.05)

    scores = resp.json()["outputs"][0]["data"]

    # Step 3: Re-rank and return (target: <5ms)
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])

    latency_ms = (time.monotonic() - start) * 1000
    if latency_ms > 100:
        log_slow_request(query, latency_ms)  # Alert on SLA breach

    return {"results": [c.to_dict() for c, _ in ranked[:10]]}

Latency Requirements by Use Case

Different products have dramatically different latency budgets. The following table shows real-world targets that production systems at major tech companies enforce. These numbers drive every architectural decision in your inference pipeline.

Use Case	P50 Latency	P99 Latency	Why This Target
Ad Click Prediction	5-10ms	20ms	Ad auctions run in 10-100ms total; ML is one step in the pipeline
Search Ranking	20-50ms	100ms	Users expect instant results; Google found 100ms delay = 1% revenue loss
Recommendation Feed	50-100ms	200ms	Feed must feel instant on scroll; pre-fetch helps but ranking is real-time
Fraud Detection	50-100ms	500ms	Must complete before payment authorization timeout (typically 2-3s)
Chatbot / LLM	200ms TTFT	1s TTFT	Time-to-first-token matters more than total latency for streaming UX
Image Generation	2-10s	30s	Users accept longer waits for creative tasks; show progress indicator
Content Moderation	100-500ms	2s	Must complete before content is visible to other users
Autonomous Driving	<10ms	<30ms	Safety-critical; every millisecond affects stopping distance at speed

💡

P99 matters more than P50: Your average latency can look great while 1% of users experience terrible performance. Always design for P99 (or P99.9 for critical paths). A search engine serving 10,000 QPS at P99 = 200ms means 100 users every second wait over 200ms.

The Inference Server Landscape

Choosing the right inference server is one of the most impactful decisions you will make. Here is a practical comparison of the major options as of 2025.

Server	Best For	GPU Support	Key Features	Limitations
NVIDIA Triton	Multi-framework, multi-model serving	NVIDIA GPUs (CUDA)	Dynamic batching, model ensembles, concurrent model execution, multiple backends (ONNX, TensorRT, PyTorch, TF)	Complex configuration, NVIDIA-only GPU support
TorchServe	PyTorch models, teams already on PyTorch	NVIDIA, some CPU optimization	Native PyTorch integration, model versioning, built-in REST/gRPC, custom handlers	PyTorch-only, less optimized than Triton for multi-model
vLLM	LLM serving (text generation)	NVIDIA, AMD (ROCm)	PagedAttention for memory efficiency, continuous batching, OpenAI-compatible API, tensor parallelism	LLM-only, not for vision or embedding models
TGI (Text Generation Inference)	HuggingFace LLM serving	NVIDIA, AMD, Intel Gaudi	Flash Attention, quantization built-in, token streaming, speculative decoding	HuggingFace ecosystem dependency, LLM-focused
TensorFlow Serving	TensorFlow/Keras models	NVIDIA, TPU	Mature, battle-tested at Google, model versioning, batching	TensorFlow-only, ecosystem is shrinking
ONNX Runtime	Cross-framework optimized inference	NVIDIA, AMD, Intel, ARM, DirectML	Framework-agnostic, graph optimizations, widest hardware support	Requires ONNX export step, some ops may not convert
Ray Serve	Complex inference graphs, multi-step pipelines	Any (uses underlying framework)	Composable inference graphs, autoscaling, Python-native, good for ensembles	Higher overhead per request, Ray cluster management

Anatomy of an Inference Request

Understanding where time is spent in a typical inference request helps you identify optimization targets. Here is the breakdown for a real-time image classification request:

# Typical latency breakdown for a single inference request
# Total budget: 50ms (P99)

# 1. Network hop: client -> load balancer -> model server
#    Latency: 1-5ms (depends on region, VPC setup)

# 2. Request deserialization (JSON/protobuf -> tensors)
#    Latency: 0.5-2ms

# 3. Preprocessing (resize, normalize, tokenize)
#    Latency: 1-10ms (CPU-bound, can be a bottleneck)

# 4. GPU transfer (CPU RAM -> GPU VRAM)
#    Latency: 0.1-1ms (PCIe bandwidth ~12 GB/s)

# 5. Model forward pass (the actual computation)
#    Latency: 2-20ms (depends on model size, batch size, GPU)

# 6. GPU transfer back (GPU VRAM -> CPU RAM)
#    Latency: 0.1-0.5ms

# 7. Postprocessing (softmax, top-k, formatting)
#    Latency: 0.5-2ms

# 8. Response serialization
#    Latency: 0.5-1ms

# Total: ~6-42ms for a well-optimized pipeline
# Key insight: preprocessing and model forward pass dominate

💡

Optimization priority: Always profile before optimizing. In practice, the biggest wins come from (1) batching multiple requests into a single forward pass, (2) model optimization (quantization, compilation), and (3) preprocessing optimization. Network and serialization are rarely the bottleneck.

Choosing Your Inference Pattern

Use this decision framework to select the right inference pattern for your use case:

Question	If Yes	If No
Is the user actively waiting for the result?	Real-time inference	Continue below
Do you need predictions within seconds of an event?	Near-real-time (streaming)	Continue below
Can you precompute predictions on a schedule?	Batch inference	Re-evaluate requirements
Do you need both precomputed and real-time?	Hybrid: batch + real-time fallback	Pick one pattern

💡

Hybrid is common: Many production systems use batch inference for the majority of predictions (cheap, simple) and fall back to real-time inference for cold-start users or items that were not in the batch. Netflix, for example, pre-computes recommendations but ranks them in real-time based on the current session.

What Is Next

Now that you understand the three inference patterns, latency requirements, and the server landscape, the next lesson dives deep into Model Server Design. You will learn how TorchServe, Triton, vLLM, and TGI actually work internally — model loading, GPU memory management, multi-model serving, and how to write production Triton configurations.

Next → Model Server Design