On-Device Deployment Advanced

Running language models directly on user devices — smartphones, laptops, and edge servers — eliminates cloud costs, removes latency, ensures privacy, and enables offline operation. This lesson covers the frameworks, optimization techniques, and deployment patterns for on-device SLM deployment.

Deployment Frameworks

Framework Platform Best For
llama.cpp Desktop, server, mobile (C++) CPU inference, GGUF models, widest hardware support
MLC-LLM Mobile, desktop, browser GPU-accelerated mobile inference, cross-platform
MediaPipe LLM Android, iOS, web Google models (Gemma), easy integration, production-ready
WebLLM Browser (WebGPU) In-browser inference, no server needed, web applications
Ollama Desktop (Mac, Linux, Windows) Easy local deployment, model management, API compatibility

Deployment Patterns

  1. Fully On-Device

    The model runs entirely on the user's device with no server communication. Best for privacy-sensitive applications, offline scenarios, and reducing cloud costs to zero.

  2. Hybrid (On-Device + Cloud)

    Use a small on-device model for simple tasks and latency-sensitive operations, falling back to a cloud-hosted large model for complex reasoning. This balances cost, speed, and capability.

  3. Edge Server

    Deploy models on local network servers (in retail stores, factories, or offices) to serve multiple devices while keeping data within organizational boundaries.

Quick Start: Ollama

# Install and run a model locally
ollama pull phi3:mini
ollama run phi3:mini "Explain recursion briefly"

# Use via API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi3:mini",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Mobile Considerations

  • Model size: Keep models under 2 GB for a good user experience. A 3B model quantized to Q4 is approximately 1.5 GB.
  • First-token latency: Model loading can take 2-5 seconds on mobile. Pre-load models when the app starts, not when the user sends a message.
  • Battery impact: Sustained inference drains batteries quickly. Implement token limits, caching, and consider showing a "thinking" indicator.
  • Memory pressure: Mobile OSes aggressively reclaim memory. Handle model eviction gracefully and reload when needed.
  • Thermal throttling: Extended inference causes the device to heat up and throttle CPU/GPU speeds. Monitor thermal state and adjust generation parameters.
Testing Strategy: Always test on the lowest-spec device you intend to support. A model that runs smoothly on the latest iPhone may be unusably slow on a 3-year-old Android device. Define your minimum hardware requirements early.

Next: Best Practices

In the final lesson, you will learn model selection frameworks, fine-tuning strategies, and production deployment patterns for SLMs.

Next: Best Practices →