On-Device Deployment Advanced

Running language models directly on user devices — smartphones, laptops, and edge servers — eliminates cloud costs, removes latency, ensures privacy, and enables offline operation. This lesson covers the frameworks, optimization techniques, and deployment patterns for on-device SLM deployment.

Deployment Frameworks

Framework	Platform	Best For
llama.cpp	Desktop, server, mobile (C++)	CPU inference, GGUF models, widest hardware support
MLC-LLM	Mobile, desktop, browser	GPU-accelerated mobile inference, cross-platform
MediaPipe LLM	Android, iOS, web	Google models (Gemma), easy integration, production-ready
WebLLM	Browser (WebGPU)	In-browser inference, no server needed, web applications
Ollama	Desktop (Mac, Linux, Windows)	Easy local deployment, model management, API compatibility

Deployment Patterns

Fully On-Device
The model runs entirely on the user's device with no server communication. Best for privacy-sensitive applications, offline scenarios, and reducing cloud costs to zero.
Hybrid (On-Device + Cloud)
Use a small on-device model for simple tasks and latency-sensitive operations, falling back to a cloud-hosted large model for complex reasoning. This balances cost, speed, and capability.
Edge Server
Deploy models on local network servers (in retail stores, factories, or offices) to serve multiple devices while keeping data within organizational boundaries.

Quick Start: Ollama

# Install and run a model locally
ollama pull phi3:mini
ollama run phi3:mini "Explain recursion briefly"

# Use via API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi3:mini",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Mobile Considerations

Model size: Keep models under 2 GB for a good user experience. A 3B model quantized to Q4 is approximately 1.5 GB.
First-token latency: Model loading can take 2-5 seconds on mobile. Pre-load models when the app starts, not when the user sends a message.
Battery impact: Sustained inference drains batteries quickly. Implement token limits, caching, and consider showing a "thinking" indicator.
Memory pressure: Mobile OSes aggressively reclaim memory. Handle model eviction gracefully and reload when needed.
Thermal throttling: Extended inference causes the device to heat up and throttle CPU/GPU speeds. Monitor thermal state and adjust generation parameters.

Testing Strategy: Always test on the lowest-spec device you intend to support. A model that runs smoothly on the latest iPhone may be unusably slow on a 3-year-old Android device. Define your minimum hardware requirements early.

Next: Best Practices

In the final lesson, you will learn model selection frameworks, fine-tuning strategies, and production deployment patterns for SLMs.

Next: Best Practices →

← Quantization Best Practices →