Beginner

ML Inference Architecture Overview

Understand the three fundamental inference patterns — batch, near-real-time, and real-time — learn the latency budgets that production systems actually enforce, and survey the inference server landscape so you can make informed architecture decisions from day one.

What Is ML Inference?

ML inference is the process of running a trained model on new input data to produce predictions. Training happens once (or periodically); inference happens millions of times per day. In most production systems, inference accounts for 80-90% of total ML compute cost.

The engineering challenge is not just getting predictions — it is getting them fast enough, reliably enough, and cheaply enough to meet your product's requirements. A recommendation model that takes 5 seconds to respond is useless in a real-time feed. A fraud detection model that runs once per hour misses transactions happening right now.

💡
Production reality: Most ML engineering effort goes into inference, not training. You train a model once over days or weeks, but you serve it millions of times per second for months. Inference system design is where most of the real engineering complexity lives.

Three Inference Patterns

Every ML serving system falls into one of three categories based on latency requirements and how requests arrive.

1. Batch Inference

Run predictions on a large dataset all at once, typically on a schedule (hourly, daily). Results are stored in a database or data warehouse for later consumption.

  • Latency: Minutes to hours (not user-facing)
  • Examples: Nightly product recommendations, weekly churn predictions, daily fraud scoring of all transactions
  • Infrastructure: Spark, AWS Batch, Kubernetes Jobs, or simple cron + GPU instances
  • When to use: Predictions do not need to reflect real-time data; result freshness of hours is acceptable
# Batch inference with PyTorch - process entire dataset
import torch
from torch.utils.data import DataLoader

model = torch.load("model.pt")
model.eval()

dataset = load_dataset("s3://data/daily-users.parquet")
loader = DataLoader(dataset, batch_size=256, num_workers=4)

results = []
with torch.no_grad():
    for batch in loader:
        predictions = model(batch.to("cuda"))
        results.extend(predictions.cpu().tolist())

# Write results to database for downstream consumers
write_to_postgres(results, table="daily_predictions")

2. Near-Real-Time Inference (Streaming)

Process events as they arrive through a message queue or stream, with latency requirements in the range of 100ms to a few seconds. Results may be written to a database or pushed to a downstream service.

  • Latency: 100ms – 5 seconds
  • Examples: Transaction fraud scoring, content moderation on upload, real-time anomaly detection on metrics
  • Infrastructure: Kafka + model server, AWS Lambda + SageMaker endpoint, Flink + embedded model
  • When to use: Events arrive continuously; you need fresh predictions but can tolerate sub-second to low-second delays
# Near-real-time inference with Kafka consumer
from kafka import KafkaConsumer, KafkaProducer
import requests, json

consumer = KafkaConsumer(
    "transactions",
    bootstrap_servers="kafka:9092",
    value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)
producer = KafkaProducer(
    bootstrap_servers="kafka:9092",
    value_serializer=lambda v: json.dumps(v).encode("utf-8")
)

for message in consumer:
    txn = message.value
    # Call model server with <100ms timeout
    resp = requests.post(
        "http://triton:8000/v2/models/fraud/infer",
        json={"inputs": [{"name": "features", "data": txn["features"]}]},
        timeout=0.1
    )
    score = resp.json()["outputs"][0]["data"][0]
    txn["fraud_score"] = score
    txn["is_fraud"] = score > 0.85
    producer.send("fraud-results", value=txn)

3. Real-Time Inference (Online)

Synchronous request-response serving where the user is actively waiting. The model prediction is part of the critical path of a user-facing API call. This is the hardest pattern to get right.

  • Latency: 1ms – 1 second (depends on use case)
  • Examples: Ad click prediction, search ranking, autocomplete, chatbot responses, real-time translation
  • Infrastructure: Dedicated model servers (Triton, TorchServe, vLLM) behind load balancers, with GPU pools
  • When to use: The prediction is on the critical user-facing path; every millisecond of latency impacts user experience or revenue
# Real-time inference API with FastAPI + model server
from fastapi import FastAPI, HTTPException
import httpx, time

app = FastAPI()
MODEL_URL = "http://triton:8001/v2/models/ranker/infer"

@app.post("/api/search")
async def search(query: str, user_id: str):
    start = time.monotonic()

    # Step 1: Retrieve candidates (target: <20ms)
    candidates = await retrieve_candidates(query, top_k=100)

    # Step 2: Score with ML model (target: <30ms)
    async with httpx.AsyncClient() as client:
        resp = await client.post(MODEL_URL, json={
            "inputs": [{
                "name": "features",
                "shape": [len(candidates), 128],
                "datatype": "FP32",
                "data": [c.features for c in candidates]
            }]
        }, timeout=0.05)

    scores = resp.json()["outputs"][0]["data"]

    # Step 3: Re-rank and return (target: <5ms)
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])

    latency_ms = (time.monotonic() - start) * 1000
    if latency_ms > 100:
        log_slow_request(query, latency_ms)  # Alert on SLA breach

    return {"results": [c.to_dict() for c, _ in ranked[:10]]}

Latency Requirements by Use Case

Different products have dramatically different latency budgets. The following table shows real-world targets that production systems at major tech companies enforce. These numbers drive every architectural decision in your inference pipeline.

Use CaseP50 LatencyP99 LatencyWhy This Target
Ad Click Prediction 5-10ms 20ms Ad auctions run in 10-100ms total; ML is one step in the pipeline
Search Ranking 20-50ms 100ms Users expect instant results; Google found 100ms delay = 1% revenue loss
Recommendation Feed 50-100ms 200ms Feed must feel instant on scroll; pre-fetch helps but ranking is real-time
Fraud Detection 50-100ms 500ms Must complete before payment authorization timeout (typically 2-3s)
Chatbot / LLM 200ms TTFT 1s TTFT Time-to-first-token matters more than total latency for streaming UX
Image Generation 2-10s 30s Users accept longer waits for creative tasks; show progress indicator
Content Moderation 100-500ms 2s Must complete before content is visible to other users
Autonomous Driving <10ms <30ms Safety-critical; every millisecond affects stopping distance at speed
💡
P99 matters more than P50: Your average latency can look great while 1% of users experience terrible performance. Always design for P99 (or P99.9 for critical paths). A search engine serving 10,000 QPS at P99 = 200ms means 100 users every second wait over 200ms.

The Inference Server Landscape

Choosing the right inference server is one of the most impactful decisions you will make. Here is a practical comparison of the major options as of 2025.

ServerBest ForGPU SupportKey FeaturesLimitations
NVIDIA Triton Multi-framework, multi-model serving NVIDIA GPUs (CUDA) Dynamic batching, model ensembles, concurrent model execution, multiple backends (ONNX, TensorRT, PyTorch, TF) Complex configuration, NVIDIA-only GPU support
TorchServe PyTorch models, teams already on PyTorch NVIDIA, some CPU optimization Native PyTorch integration, model versioning, built-in REST/gRPC, custom handlers PyTorch-only, less optimized than Triton for multi-model
vLLM LLM serving (text generation) NVIDIA, AMD (ROCm) PagedAttention for memory efficiency, continuous batching, OpenAI-compatible API, tensor parallelism LLM-only, not for vision or embedding models
TGI (Text Generation Inference) HuggingFace LLM serving NVIDIA, AMD, Intel Gaudi Flash Attention, quantization built-in, token streaming, speculative decoding HuggingFace ecosystem dependency, LLM-focused
TensorFlow Serving TensorFlow/Keras models NVIDIA, TPU Mature, battle-tested at Google, model versioning, batching TensorFlow-only, ecosystem is shrinking
ONNX Runtime Cross-framework optimized inference NVIDIA, AMD, Intel, ARM, DirectML Framework-agnostic, graph optimizations, widest hardware support Requires ONNX export step, some ops may not convert
Ray Serve Complex inference graphs, multi-step pipelines Any (uses underlying framework) Composable inference graphs, autoscaling, Python-native, good for ensembles Higher overhead per request, Ray cluster management

Anatomy of an Inference Request

Understanding where time is spent in a typical inference request helps you identify optimization targets. Here is the breakdown for a real-time image classification request:

# Typical latency breakdown for a single inference request
# Total budget: 50ms (P99)

# 1. Network hop: client -> load balancer -> model server
#    Latency: 1-5ms (depends on region, VPC setup)

# 2. Request deserialization (JSON/protobuf -> tensors)
#    Latency: 0.5-2ms

# 3. Preprocessing (resize, normalize, tokenize)
#    Latency: 1-10ms (CPU-bound, can be a bottleneck)

# 4. GPU transfer (CPU RAM -> GPU VRAM)
#    Latency: 0.1-1ms (PCIe bandwidth ~12 GB/s)

# 5. Model forward pass (the actual computation)
#    Latency: 2-20ms (depends on model size, batch size, GPU)

# 6. GPU transfer back (GPU VRAM -> CPU RAM)
#    Latency: 0.1-0.5ms

# 7. Postprocessing (softmax, top-k, formatting)
#    Latency: 0.5-2ms

# 8. Response serialization
#    Latency: 0.5-1ms

# Total: ~6-42ms for a well-optimized pipeline
# Key insight: preprocessing and model forward pass dominate
💡
Optimization priority: Always profile before optimizing. In practice, the biggest wins come from (1) batching multiple requests into a single forward pass, (2) model optimization (quantization, compilation), and (3) preprocessing optimization. Network and serialization are rarely the bottleneck.

Choosing Your Inference Pattern

Use this decision framework to select the right inference pattern for your use case:

QuestionIf YesIf No
Is the user actively waiting for the result? Real-time inference Continue below
Do you need predictions within seconds of an event? Near-real-time (streaming) Continue below
Can you precompute predictions on a schedule? Batch inference Re-evaluate requirements
Do you need both precomputed and real-time? Hybrid: batch + real-time fallback Pick one pattern
💡
Hybrid is common: Many production systems use batch inference for the majority of predictions (cheap, simple) and fall back to real-time inference for cold-start users or items that were not in the batch. Netflix, for example, pre-computes recommendations but ranks them in real-time based on the current session.

What Is Next

Now that you understand the three inference patterns, latency requirements, and the server landscape, the next lesson dives deep into Model Server Design. You will learn how TorchServe, Triton, vLLM, and TGI actually work internally — model loading, GPU memory management, multi-model serving, and how to write production Triton configurations.