Running Local LLMs
Set up and run LLMs on your own hardware — from hardware requirements and quantization to practical tools like Ollama and LM Studio.
Hardware Requirements
| Model Size | FP16 VRAM | Q4 VRAM | Recommended GPU |
|---|---|---|---|
| 3B | 6 GB | 2 GB | Any modern GPU / CPU-only |
| 7-8B | 16 GB | 4-6 GB | RTX 3060 12GB, RTX 4060 |
| 13B | 26 GB | 8-10 GB | RTX 3090, RTX 4070 Ti |
| 34B | 68 GB | 20 GB | RTX 4090, A6000 |
| 70B | 140 GB | 40 GB | 2x RTX 4090, A100 80GB |
Quantization
Quantization reduces model precision to use less memory with minimal quality loss:
| Format | Bits | Size Reduction | Quality Impact | Tool |
|---|---|---|---|---|
| FP16 | 16-bit | Baseline | None | Native |
| Q8 | 8-bit | ~50% | Negligible | GGUF, bitsandbytes |
| Q5 | 5-bit | ~69% | Very small | GGUF |
| Q4 | 4-bit | ~75% | Small | GGUF, GPTQ, AWQ |
| Q3 | 3-bit | ~81% | Noticeable | GGUF |
| Q2 | 2-bit | ~87% | Significant | GGUF |
Ollama
The simplest way to run LLMs locally. One command to install and run models.
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (downloads automatically on first run)
ollama run llama3.1 # LLaMA 3.1 8B
ollama run mistral # Mistral 7B
ollama run codellama # Code LLaMA 7B
ollama run phi3 # Phi-3 Mini 3.8B
ollama run qwen2.5:14b # Qwen 2.5 14B
# List installed models
ollama list
# Use the API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain quantum computing in simple terms"
}'
# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Hello!"}]
}'
LM Studio
A desktop application with a GUI for downloading, running, and chatting with local LLMs. Supports GGUF models from Hugging Face.
- Download from lmstudio.ai
- Browse and download models from the built-in model catalog
- Chat interface with conversation history
- Local OpenAI-compatible API server
- Supports Windows, macOS, and Linux
llama.cpp
High-performance C/C++ implementation for running LLMs. Powers many other tools including Ollama.
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
# Run inference
./llama-cli -m models/llama-3-8b-Q4_K_M.gguf \
-p "Explain machine learning:" \
-n 256 \
--temp 0.7
# Start a server (OpenAI-compatible API)
./llama-server -m models/llama-3-8b-Q4_K_M.gguf \
--port 8080 \
--ctx-size 4096
vLLM (Production Serving)
High-throughput serving engine optimized for production workloads with PagedAttention:
# Install
pip install vllm
# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
Performance Benchmarks
| Model | Quantization | GPU | Tokens/sec |
|---|---|---|---|
| LLaMA 3 8B | Q4_K_M | RTX 4090 | ~80-100 |
| LLaMA 3 8B | Q4_K_M | RTX 3060 12GB | ~30-40 |
| LLaMA 3 8B | Q4_K_M | M2 Pro (CPU) | ~15-25 |
| LLaMA 3 70B | Q4_K_M | 2x RTX 4090 | ~15-25 |
| Mistral 7B | Q4_K_M | RTX 4090 | ~90-110 |
| Phi-3 Mini 3.8B | Q4_K_M | RTX 3060 12GB | ~60-80 |
Privacy and Cost Benefits
Complete Privacy
No data leaves your machine. Critical for healthcare, finance, legal, and any sensitive data processing.
Zero API Costs
After hardware investment, inference is free. High-volume use cases save significantly vs per-token API pricing.
No Rate Limits
Run as many requests as your hardware supports. No throttling, no quotas, no waiting.
Offline Operation
Works without internet connectivity. Useful for air-gapped environments, field deployments, or unreliable networks.
Lilly Tech Systems