On-Device Deployment Advanced
Running language models directly on user devices — smartphones, laptops, and edge servers — eliminates cloud costs, removes latency, ensures privacy, and enables offline operation. This lesson covers the frameworks, optimization techniques, and deployment patterns for on-device SLM deployment.
Deployment Frameworks
| Framework | Platform | Best For |
|---|---|---|
| llama.cpp | Desktop, server, mobile (C++) | CPU inference, GGUF models, widest hardware support |
| MLC-LLM | Mobile, desktop, browser | GPU-accelerated mobile inference, cross-platform |
| MediaPipe LLM | Android, iOS, web | Google models (Gemma), easy integration, production-ready |
| WebLLM | Browser (WebGPU) | In-browser inference, no server needed, web applications |
| Ollama | Desktop (Mac, Linux, Windows) | Easy local deployment, model management, API compatibility |
Deployment Patterns
-
Fully On-Device
The model runs entirely on the user's device with no server communication. Best for privacy-sensitive applications, offline scenarios, and reducing cloud costs to zero.
-
Hybrid (On-Device + Cloud)
Use a small on-device model for simple tasks and latency-sensitive operations, falling back to a cloud-hosted large model for complex reasoning. This balances cost, speed, and capability.
-
Edge Server
Deploy models on local network servers (in retail stores, factories, or offices) to serve multiple devices while keeping data within organizational boundaries.
Quick Start: Ollama
# Install and run a model locally
ollama pull phi3:mini
ollama run phi3:mini "Explain recursion briefly"
# Use via API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi3:mini",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Mobile Considerations
- Model size: Keep models under 2 GB for a good user experience. A 3B model quantized to Q4 is approximately 1.5 GB.
- First-token latency: Model loading can take 2-5 seconds on mobile. Pre-load models when the app starts, not when the user sends a message.
- Battery impact: Sustained inference drains batteries quickly. Implement token limits, caching, and consider showing a "thinking" indicator.
- Memory pressure: Mobile OSes aggressively reclaim memory. Handle model eviction gracefully and reload when needed.
- Thermal throttling: Extended inference causes the device to heat up and throttle CPU/GPU speeds. Monitor thermal state and adjust generation parameters.
Next: Best Practices
In the final lesson, you will learn model selection frameworks, fine-tuning strategies, and production deployment patterns for SLMs.
Next: Best Practices →
Lilly Tech Systems