Beginner

Edge AI Architecture

Running AI models directly on edge devices — phones, cameras, sensors, drones — eliminates the round trip to the cloud. The result is sub-millisecond inference latency, data that never leaves the device, zero bandwidth costs, and systems that work with no internet connection. This lesson covers when and why to deploy at the edge, how to choose hardware, and the architecture patterns that make it work in production.

Why Edge AI

Cloud inference has served AI well, but four forces are pushing inference to the edge:

FactorCloud InferenceEdge Inference
Latency 50-500ms round trip (network + inference) 1-50ms local inference, no network hop
Privacy Data leaves the device, regulatory risk Data stays on device, GDPR/HIPAA friendly
Bandwidth Streaming video/audio to cloud is expensive Only send results, not raw data (100x savings)
Cost Per-request API costs scale linearly Fixed hardware cost, zero per-inference cost
Availability No internet = no AI Works offline, 100% uptime

Edge vs Cloud vs Hybrid: Decision Framework

Not everything belongs at the edge. Use this decision framework to choose the right deployment model:

# Edge AI deployment decision framework

def choose_deployment(use_case):
    """
    Returns: 'edge', 'cloud', or 'hybrid' with rationale.
    """
    # Factor 1: Latency requirement
    if use_case.max_latency_ms < 50:
        deployment = "edge"  # Cloud can't meet this consistently

    # Factor 2: Data sensitivity
    elif use_case.data_type in ["medical_images", "biometrics", "financial"]:
        deployment = "edge"  # Regulatory: data must not leave device

    # Factor 3: Bandwidth constraints
    elif use_case.data_rate_mbps > 10:
        deployment = "edge"  # Too expensive to stream to cloud

    # Factor 4: Model complexity
    elif use_case.model_params > 1_000_000_000:  # >1B parameters
        deployment = "cloud"  # Too large for edge hardware

    # Factor 5: Connectivity
    elif use_case.connectivity == "intermittent":
        deployment = "hybrid"  # Edge primary, cloud when available

    else:
        deployment = "cloud"  # Default: simpler to manage

    return deployment

# Real examples:
print(choose_deployment(UseCase(
    name="Factory defect detection",
    max_latency_ms=20,        # Assembly line speed
    data_type="camera_feed",
    data_rate_mbps=25,        # 4K camera stream
    model_params=5_000_000,   # MobileNet-v3
    connectivity="reliable"
)))  # -> "edge" (latency + bandwidth)

print(choose_deployment(UseCase(
    name="Document summarization",
    max_latency_ms=5000,      # User can wait
    data_type="text",
    data_rate_mbps=0.01,      # Small text payloads
    model_params=7_000_000_000,  # 7B LLM
    connectivity="reliable"
)))  # -> "cloud" (model too large for most edge devices)

Edge Hardware Landscape

Choosing the right hardware is the first architecture decision. Here is the current landscape with real benchmarks:

DeviceAI PerformancePowerPriceBest For
NVIDIA Jetson Orin Nano 40 TOPS (INT8) 7-15W $199 Multi-camera vision, robotics, complex models
Google Coral Dev Board 4 TOPS (INT8) 2-4W $129 Single-model classification, low-power deployments
Raspberry Pi 5 + Hailo-8L 13 TOPS (INT8) 5-12W $100 Prototyping, hobbyist, cost-sensitive production
iPhone 15 (A16 Neural Engine) 17 TOPS 1-3W (AI) N/A On-device mobile AI, CoreML models
Qualcomm QCS6490 12 TOPS 5-10W $50-80 Smart cameras, always-on AI, industrial IoT
ESP32-S3 (MCU) ~0.01 TOPS 0.1-0.5W $3 Keyword detection, simple sensor classification
💡
Apply at work: Start prototyping on a Raspberry Pi 5 with Hailo-8L ($100 total). If your model runs there, it will run on any production edge hardware. Upgrade to Jetson Orin only if you need multi-stream video processing or models larger than 50MB.

Edge AI Architecture Patterns

Production edge AI systems follow one of three architecture patterns. Each handles the relationship between the edge device and cloud differently:

# Pattern 1: Full Edge (no cloud dependency)
# Use when: privacy-critical, no connectivity, latency < 10ms
#
# [Camera] -> [Edge Device] -> [Local Action]
#                |
#          [Model + Logic]
#          [Local Storage]

class FullEdgeArchitecture:
    def __init__(self, model_path: str):
        self.model = load_model(model_path)  # TFLite or ONNX
        self.local_db = SQLite("edge_results.db")

    def process(self, frame):
        result = self.model.predict(frame)      # 5-20ms
        self.local_db.store(result)             # Local logging
        if result.confidence > 0.9:
            self.trigger_action(result)          # GPIO, alert, etc.
        return result

# Pattern 2: Edge + Cloud Sync (hybrid)
# Use when: edge for real-time, cloud for analytics/retraining
#
# [Camera] -> [Edge Device] -> [Local Action]
#                |                    |
#          [Model + Logic]      [Data Queue]
#                                     |
#                              [Cloud Sync] (periodic)
#                                     |
#                              [Cloud Analytics]

class HybridEdgeArchitecture:
    def __init__(self, model_path: str, sync_interval: int = 300):
        self.model = load_model(model_path)
        self.queue = PersistentQueue("sync_queue.db")
        self.sync_interval = sync_interval  # seconds

    def process(self, frame):
        result = self.model.predict(frame)
        self.trigger_action(result)

        # Queue metadata for cloud sync (not raw data)
        self.queue.push({
            "timestamp": time.time(),
            "prediction": result.label,
            "confidence": result.confidence,
            "device_id": self.device_id
        })
        return result

    async def sync_to_cloud(self):
        """Called every sync_interval seconds when online."""
        batch = self.queue.pop_batch(max_size=1000)
        if batch:
            await cloud_api.upload(batch)

# Pattern 3: Edge Preprocessing + Cloud Inference
# Use when: model too large for edge, but want to reduce bandwidth
#
# [Camera] -> [Edge Device] -> [Cloud API] -> [Result]
#                |
#          [Preprocessing]
#          (resize, crop, filter)

class EdgePreprocessArchitecture:
    def __init__(self):
        self.preprocessor = EdgePreprocessor()  # Runs on device

    def process(self, frame):
        # Edge: reduce 4K frame (12MB) to 224x224 crop (50KB)
        processed = self.preprocessor.run(frame)  # 2ms
        # Cloud: run large model on small payload
        result = cloud_api.predict(processed)      # 100-200ms
        return result

Real-World Use Cases

Edge AI is already in production across these industries. Each use case maps to a specific architecture pattern:

IndustryUse CasePatternHardwareWhy Edge
Manufacturing Defect detection on assembly line Full Edge Jetson Orin 20ms latency at line speed, no cloud dependency
Retail Shelf inventory monitoring Hybrid Coral Dev Board Real-time alerts + nightly cloud sync for analytics
Healthcare Patient fall detection Full Edge Qualcomm SoC HIPAA: video never leaves facility
Agriculture Crop disease classification Hybrid RPi 5 + Hailo No WiFi in fields, sync when in range
Automotive Driver drowsiness detection Full Edge Qualcomm SoC Safety-critical: cannot depend on connectivity
Smart Home Person/pet detection on doorbell Edge Preprocess Custom ASIC Privacy: video processed locally, only events sent
📝
Production reality: The biggest challenge in edge AI is not the model — it is the deployment pipeline. Getting a model to run on a laptop is easy. Getting it to run reliably on 10,000 devices with OTA updates, monitoring, and rollback is what separates prototypes from production. We cover this in Lessons 4-6.

Key Takeaways

  • Edge AI eliminates cloud round trips, providing sub-50ms inference latency, full data privacy, zero bandwidth costs, and offline operation.
  • Use the decision framework: edge for latency-critical, privacy-sensitive, or bandwidth-heavy workloads; cloud for large models or simple deployments; hybrid when you need both.
  • Hardware ranges from $3 microcontrollers (keyword detection) to $199 Jetson Orin (40 TOPS multi-camera vision). Start with RPi 5 + Hailo for prototyping.
  • Three architecture patterns: Full Edge (no cloud), Hybrid (edge real-time + cloud analytics), and Edge Preprocessing (reduce data before cloud inference).
  • The deployment pipeline (updates, monitoring, rollback) is harder than the model itself — plan for it from day one.

What Is Next

In the next lesson, we will cover model optimization for edge — how to take a cloud-sized model and make it 4-10x smaller using quantization, pruning, and knowledge distillation while keeping accuracy within 1-2% of the original.