Edge AI Architecture
Running AI models directly on edge devices — phones, cameras, sensors, drones — eliminates the round trip to the cloud. The result is sub-millisecond inference latency, data that never leaves the device, zero bandwidth costs, and systems that work with no internet connection. This lesson covers when and why to deploy at the edge, how to choose hardware, and the architecture patterns that make it work in production.
Why Edge AI
Cloud inference has served AI well, but four forces are pushing inference to the edge:
| Factor | Cloud Inference | Edge Inference |
|---|---|---|
| Latency | 50-500ms round trip (network + inference) | 1-50ms local inference, no network hop |
| Privacy | Data leaves the device, regulatory risk | Data stays on device, GDPR/HIPAA friendly |
| Bandwidth | Streaming video/audio to cloud is expensive | Only send results, not raw data (100x savings) |
| Cost | Per-request API costs scale linearly | Fixed hardware cost, zero per-inference cost |
| Availability | No internet = no AI | Works offline, 100% uptime |
Edge vs Cloud vs Hybrid: Decision Framework
Not everything belongs at the edge. Use this decision framework to choose the right deployment model:
# Edge AI deployment decision framework
def choose_deployment(use_case):
"""
Returns: 'edge', 'cloud', or 'hybrid' with rationale.
"""
# Factor 1: Latency requirement
if use_case.max_latency_ms < 50:
deployment = "edge" # Cloud can't meet this consistently
# Factor 2: Data sensitivity
elif use_case.data_type in ["medical_images", "biometrics", "financial"]:
deployment = "edge" # Regulatory: data must not leave device
# Factor 3: Bandwidth constraints
elif use_case.data_rate_mbps > 10:
deployment = "edge" # Too expensive to stream to cloud
# Factor 4: Model complexity
elif use_case.model_params > 1_000_000_000: # >1B parameters
deployment = "cloud" # Too large for edge hardware
# Factor 5: Connectivity
elif use_case.connectivity == "intermittent":
deployment = "hybrid" # Edge primary, cloud when available
else:
deployment = "cloud" # Default: simpler to manage
return deployment
# Real examples:
print(choose_deployment(UseCase(
name="Factory defect detection",
max_latency_ms=20, # Assembly line speed
data_type="camera_feed",
data_rate_mbps=25, # 4K camera stream
model_params=5_000_000, # MobileNet-v3
connectivity="reliable"
))) # -> "edge" (latency + bandwidth)
print(choose_deployment(UseCase(
name="Document summarization",
max_latency_ms=5000, # User can wait
data_type="text",
data_rate_mbps=0.01, # Small text payloads
model_params=7_000_000_000, # 7B LLM
connectivity="reliable"
))) # -> "cloud" (model too large for most edge devices)
Edge Hardware Landscape
Choosing the right hardware is the first architecture decision. Here is the current landscape with real benchmarks:
| Device | AI Performance | Power | Price | Best For |
|---|---|---|---|---|
| NVIDIA Jetson Orin Nano | 40 TOPS (INT8) | 7-15W | $199 | Multi-camera vision, robotics, complex models |
| Google Coral Dev Board | 4 TOPS (INT8) | 2-4W | $129 | Single-model classification, low-power deployments |
| Raspberry Pi 5 + Hailo-8L | 13 TOPS (INT8) | 5-12W | $100 | Prototyping, hobbyist, cost-sensitive production |
| iPhone 15 (A16 Neural Engine) | 17 TOPS | 1-3W (AI) | N/A | On-device mobile AI, CoreML models |
| Qualcomm QCS6490 | 12 TOPS | 5-10W | $50-80 | Smart cameras, always-on AI, industrial IoT |
| ESP32-S3 (MCU) | ~0.01 TOPS | 0.1-0.5W | $3 | Keyword detection, simple sensor classification |
Edge AI Architecture Patterns
Production edge AI systems follow one of three architecture patterns. Each handles the relationship between the edge device and cloud differently:
# Pattern 1: Full Edge (no cloud dependency)
# Use when: privacy-critical, no connectivity, latency < 10ms
#
# [Camera] -> [Edge Device] -> [Local Action]
# |
# [Model + Logic]
# [Local Storage]
class FullEdgeArchitecture:
def __init__(self, model_path: str):
self.model = load_model(model_path) # TFLite or ONNX
self.local_db = SQLite("edge_results.db")
def process(self, frame):
result = self.model.predict(frame) # 5-20ms
self.local_db.store(result) # Local logging
if result.confidence > 0.9:
self.trigger_action(result) # GPIO, alert, etc.
return result
# Pattern 2: Edge + Cloud Sync (hybrid)
# Use when: edge for real-time, cloud for analytics/retraining
#
# [Camera] -> [Edge Device] -> [Local Action]
# | |
# [Model + Logic] [Data Queue]
# |
# [Cloud Sync] (periodic)
# |
# [Cloud Analytics]
class HybridEdgeArchitecture:
def __init__(self, model_path: str, sync_interval: int = 300):
self.model = load_model(model_path)
self.queue = PersistentQueue("sync_queue.db")
self.sync_interval = sync_interval # seconds
def process(self, frame):
result = self.model.predict(frame)
self.trigger_action(result)
# Queue metadata for cloud sync (not raw data)
self.queue.push({
"timestamp": time.time(),
"prediction": result.label,
"confidence": result.confidence,
"device_id": self.device_id
})
return result
async def sync_to_cloud(self):
"""Called every sync_interval seconds when online."""
batch = self.queue.pop_batch(max_size=1000)
if batch:
await cloud_api.upload(batch)
# Pattern 3: Edge Preprocessing + Cloud Inference
# Use when: model too large for edge, but want to reduce bandwidth
#
# [Camera] -> [Edge Device] -> [Cloud API] -> [Result]
# |
# [Preprocessing]
# (resize, crop, filter)
class EdgePreprocessArchitecture:
def __init__(self):
self.preprocessor = EdgePreprocessor() # Runs on device
def process(self, frame):
# Edge: reduce 4K frame (12MB) to 224x224 crop (50KB)
processed = self.preprocessor.run(frame) # 2ms
# Cloud: run large model on small payload
result = cloud_api.predict(processed) # 100-200ms
return result
Real-World Use Cases
Edge AI is already in production across these industries. Each use case maps to a specific architecture pattern:
| Industry | Use Case | Pattern | Hardware | Why Edge |
|---|---|---|---|---|
| Manufacturing | Defect detection on assembly line | Full Edge | Jetson Orin | 20ms latency at line speed, no cloud dependency |
| Retail | Shelf inventory monitoring | Hybrid | Coral Dev Board | Real-time alerts + nightly cloud sync for analytics |
| Healthcare | Patient fall detection | Full Edge | Qualcomm SoC | HIPAA: video never leaves facility |
| Agriculture | Crop disease classification | Hybrid | RPi 5 + Hailo | No WiFi in fields, sync when in range |
| Automotive | Driver drowsiness detection | Full Edge | Qualcomm SoC | Safety-critical: cannot depend on connectivity |
| Smart Home | Person/pet detection on doorbell | Edge Preprocess | Custom ASIC | Privacy: video processed locally, only events sent |
Key Takeaways
- Edge AI eliminates cloud round trips, providing sub-50ms inference latency, full data privacy, zero bandwidth costs, and offline operation.
- Use the decision framework: edge for latency-critical, privacy-sensitive, or bandwidth-heavy workloads; cloud for large models or simple deployments; hybrid when you need both.
- Hardware ranges from $3 microcontrollers (keyword detection) to $199 Jetson Orin (40 TOPS multi-camera vision). Start with RPi 5 + Hailo for prototyping.
- Three architecture patterns: Full Edge (no cloud), Hybrid (edge real-time + cloud analytics), and Edge Preprocessing (reduce data before cloud inference).
- The deployment pipeline (updates, monitoring, rollback) is harder than the model itself — plan for it from day one.
What Is Next
In the next lesson, we will cover model optimization for edge — how to take a cloud-sized model and make it 4-10x smaller using quantization, pruning, and knowledge distillation while keeping accuracy within 1-2% of the original.
Lilly Tech Systems