Beginner

Edge AI Architecture

Running AI models directly on edge devices — phones, cameras, sensors, drones — eliminates the round trip to the cloud. The result is sub-millisecond inference latency, data that never leaves the device, zero bandwidth costs, and systems that work with no internet connection. This lesson covers when and why to deploy at the edge, how to choose hardware, and the architecture patterns that make it work in production.

Why Edge AI

Cloud inference has served AI well, but four forces are pushing inference to the edge:

Factor	Cloud Inference	Edge Inference
Latency	50-500ms round trip (network + inference)	1-50ms local inference, no network hop
Privacy	Data leaves the device, regulatory risk	Data stays on device, GDPR/HIPAA friendly
Bandwidth	Streaming video/audio to cloud is expensive	Only send results, not raw data (100x savings)
Cost	Per-request API costs scale linearly	Fixed hardware cost, zero per-inference cost
Availability	No internet = no AI	Works offline, 100% uptime

Edge vs Cloud vs Hybrid: Decision Framework

Not everything belongs at the edge. Use this decision framework to choose the right deployment model:

# Edge AI deployment decision framework

def choose_deployment(use_case):
    """
    Returns: 'edge', 'cloud', or 'hybrid' with rationale.
    """
    # Factor 1: Latency requirement
    if use_case.max_latency_ms < 50:
        deployment = "edge"  # Cloud can't meet this consistently

    # Factor 2: Data sensitivity
    elif use_case.data_type in ["medical_images", "biometrics", "financial"]:
        deployment = "edge"  # Regulatory: data must not leave device

    # Factor 3: Bandwidth constraints
    elif use_case.data_rate_mbps > 10:
        deployment = "edge"  # Too expensive to stream to cloud

    # Factor 4: Model complexity
    elif use_case.model_params > 1_000_000_000:  # >1B parameters
        deployment = "cloud"  # Too large for edge hardware

    # Factor 5: Connectivity
    elif use_case.connectivity == "intermittent":
        deployment = "hybrid"  # Edge primary, cloud when available

    else:
        deployment = "cloud"  # Default: simpler to manage

    return deployment

# Real examples:
print(choose_deployment(UseCase(
    name="Factory defect detection",
    max_latency_ms=20,        # Assembly line speed
    data_type="camera_feed",
    data_rate_mbps=25,        # 4K camera stream
    model_params=5_000_000,   # MobileNet-v3
    connectivity="reliable"
)))  # -> "edge" (latency + bandwidth)

print(choose_deployment(UseCase(
    name="Document summarization",
    max_latency_ms=5000,      # User can wait
    data_type="text",
    data_rate_mbps=0.01,      # Small text payloads
    model_params=7_000_000_000,  # 7B LLM
    connectivity="reliable"
)))  # -> "cloud" (model too large for most edge devices)

Edge Hardware Landscape

Choosing the right hardware is the first architecture decision. Here is the current landscape with real benchmarks:

Device	AI Performance	Power	Price	Best For
NVIDIA Jetson Orin Nano	40 TOPS (INT8)	7-15W	$199	Multi-camera vision, robotics, complex models
Google Coral Dev Board	4 TOPS (INT8)	2-4W	$129	Single-model classification, low-power deployments
Raspberry Pi 5 + Hailo-8L	13 TOPS (INT8)	5-12W	$100	Prototyping, hobbyist, cost-sensitive production
iPhone 15 (A16 Neural Engine)	17 TOPS	1-3W (AI)	N/A	On-device mobile AI, CoreML models
Qualcomm QCS6490	12 TOPS	5-10W	$50-80	Smart cameras, always-on AI, industrial IoT
ESP32-S3 (MCU)	~0.01 TOPS	0.1-0.5W	$3	Keyword detection, simple sensor classification

💡

Apply at work: Start prototyping on a Raspberry Pi 5 with Hailo-8L ($100 total). If your model runs there, it will run on any production edge hardware. Upgrade to Jetson Orin only if you need multi-stream video processing or models larger than 50MB.

Edge AI Architecture Patterns

Production edge AI systems follow one of three architecture patterns. Each handles the relationship between the edge device and cloud differently:

# Pattern 1: Full Edge (no cloud dependency)
# Use when: privacy-critical, no connectivity, latency < 10ms
#
# [Camera] -> [Edge Device] -> [Local Action]
#                |
#          [Model + Logic]
#          [Local Storage]

class FullEdgeArchitecture:
    def __init__(self, model_path: str):
        self.model = load_model(model_path)  # TFLite or ONNX
        self.local_db = SQLite("edge_results.db")

    def process(self, frame):
        result = self.model.predict(frame)      # 5-20ms
        self.local_db.store(result)             # Local logging
        if result.confidence > 0.9:
            self.trigger_action(result)          # GPIO, alert, etc.
        return result

# Pattern 2: Edge + Cloud Sync (hybrid)
# Use when: edge for real-time, cloud for analytics/retraining
#
# [Camera] -> [Edge Device] -> [Local Action]
#                |                    |
#          [Model + Logic]      [Data Queue]
#                                     |
#                              [Cloud Sync] (periodic)
#                                     |
#                              [Cloud Analytics]

class HybridEdgeArchitecture:
    def __init__(self, model_path: str, sync_interval: int = 300):
        self.model = load_model(model_path)
        self.queue = PersistentQueue("sync_queue.db")
        self.sync_interval = sync_interval  # seconds

    def process(self, frame):
        result = self.model.predict(frame)
        self.trigger_action(result)

        # Queue metadata for cloud sync (not raw data)
        self.queue.push({
            "timestamp": time.time(),
            "prediction": result.label,
            "confidence": result.confidence,
            "device_id": self.device_id
        })
        return result

    async def sync_to_cloud(self):
        """Called every sync_interval seconds when online."""
        batch = self.queue.pop_batch(max_size=1000)
        if batch:
            await cloud_api.upload(batch)

# Pattern 3: Edge Preprocessing + Cloud Inference
# Use when: model too large for edge, but want to reduce bandwidth
#
# [Camera] -> [Edge Device] -> [Cloud API] -> [Result]
#                |
#          [Preprocessing]
#          (resize, crop, filter)

class EdgePreprocessArchitecture:
    def __init__(self):
        self.preprocessor = EdgePreprocessor()  # Runs on device

    def process(self, frame):
        # Edge: reduce 4K frame (12MB) to 224x224 crop (50KB)
        processed = self.preprocessor.run(frame)  # 2ms
        # Cloud: run large model on small payload
        result = cloud_api.predict(processed)      # 100-200ms
        return result

Real-World Use Cases

Edge AI is already in production across these industries. Each use case maps to a specific architecture pattern:

Industry	Use Case	Pattern	Hardware	Why Edge
Manufacturing	Defect detection on assembly line	Full Edge	Jetson Orin	20ms latency at line speed, no cloud dependency
Retail	Shelf inventory monitoring	Hybrid	Coral Dev Board	Real-time alerts + nightly cloud sync for analytics
Healthcare	Patient fall detection	Full Edge	Qualcomm SoC	HIPAA: video never leaves facility
Agriculture	Crop disease classification	Hybrid	RPi 5 + Hailo	No WiFi in fields, sync when in range
Automotive	Driver drowsiness detection	Full Edge	Qualcomm SoC	Safety-critical: cannot depend on connectivity
Smart Home	Person/pet detection on doorbell	Edge Preprocess	Custom ASIC	Privacy: video processed locally, only events sent

📝

Production reality: The biggest challenge in edge AI is not the model — it is the deployment pipeline. Getting a model to run on a laptop is easy. Getting it to run reliably on 10,000 devices with OTA updates, monitoring, and rollback is what separates prototypes from production. We cover this in Lessons 4-6.

Key Takeaways

Edge AI eliminates cloud round trips, providing sub-50ms inference latency, full data privacy, zero bandwidth costs, and offline operation.
Use the decision framework: edge for latency-critical, privacy-sensitive, or bandwidth-heavy workloads; cloud for large models or simple deployments; hybrid when you need both.
Hardware ranges from $3 microcontrollers (keyword detection) to $199 Jetson Orin (40 TOPS multi-camera vision). Start with RPi 5 + Hailo for prototyping.
Three architecture patterns: Full Edge (no cloud), Hybrid (edge real-time + cloud analytics), and Edge Preprocessing (reduce data before cloud inference).
The deployment pipeline (updates, monitoring, rollback) is harder than the model itself — plan for it from day one.

What Is Next

In the next lesson, we will cover model optimization for edge — how to take a cloud-sized model and make it 4-10x smaller using quantization, pruning, and knowledge distillation while keeping accuracy within 1-2% of the original.

Next → Model Optimization for Edge