Intermediate

Running Local LLMs

Set up and run LLMs on your own hardware — from hardware requirements and quantization to practical tools like Ollama and LM Studio.

Hardware Requirements

Model Size	FP16 VRAM	Q4 VRAM	Recommended GPU
3B	6 GB	2 GB	Any modern GPU / CPU-only
7-8B	16 GB	4-6 GB	RTX 3060 12GB, RTX 4060
13B	26 GB	8-10 GB	RTX 3090, RTX 4070 Ti
34B	68 GB	20 GB	RTX 4090, A6000
70B	140 GB	40 GB	2x RTX 4090, A100 80GB

💡

CPU inference: You can run quantized models on CPU using system RAM instead of VRAM. It's 10-50x slower than GPU but works. A 7B Q4 model needs about 6GB RAM. llama.cpp and Ollama both support CPU inference.

Quantization

Quantization reduces model precision to use less memory with minimal quality loss:

Format	Bits	Size Reduction	Quality Impact	Tool
FP16	16-bit	Baseline	None	Native
Q8	8-bit	~50%	Negligible	GGUF, bitsandbytes
Q5	5-bit	~69%	Very small	GGUF
Q4	4-bit	~75%	Small	GGUF, GPTQ, AWQ
Q3	3-bit	~81%	Noticeable	GGUF
Q2	2-bit	~87%	Significant	GGUF

✅

Sweet spot: Q4 (4-bit) quantization offers the best balance of size reduction and quality. A 7B model goes from 14GB (FP16) to ~4GB (Q4) with minimal quality loss. Q5 is slightly better quality for a small size increase.

Ollama

The simplest way to run LLMs locally. One command to install and run models.

Shell — Getting started with Ollama

# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically on first run)
ollama run llama3.1          # LLaMA 3.1 8B
ollama run mistral           # Mistral 7B
ollama run codellama         # Code LLaMA 7B
ollama run phi3              # Phi-3 Mini 3.8B
ollama run qwen2.5:14b       # Qwen 2.5 14B

# List installed models
ollama list

# Use the API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain quantum computing in simple terms"
}'

# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

LM Studio

A desktop application with a GUI for downloading, running, and chatting with local LLMs. Supports GGUF models from Hugging Face.

Download from lmstudio.ai
Browse and download models from the built-in model catalog
Chat interface with conversation history
Local OpenAI-compatible API server
Supports Windows, macOS, and Linux

llama.cpp

High-performance C/C++ implementation for running LLMs. Powers many other tools including Ollama.

Shell — Using llama.cpp

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Run inference
./llama-cli -m models/llama-3-8b-Q4_K_M.gguf \
  -p "Explain machine learning:" \
  -n 256 \
  --temp 0.7

# Start a server (OpenAI-compatible API)
./llama-server -m models/llama-3-8b-Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 4096

vLLM (Production Serving)

High-throughput serving engine optimized for production workloads with PagedAttention:

Shell — vLLM setup

# Install
pip install vllm

# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Performance Benchmarks

Model	Quantization	GPU	Tokens/sec
LLaMA 3 8B	Q4_K_M	RTX 4090	~80-100
LLaMA 3 8B	Q4_K_M	RTX 3060 12GB	~30-40
LLaMA 3 8B	Q4_K_M	M2 Pro (CPU)	~15-25
LLaMA 3 70B	Q4_K_M	2x RTX 4090	~15-25
Mistral 7B	Q4_K_M	RTX 4090	~90-110
Phi-3 Mini 3.8B	Q4_K_M	RTX 3060 12GB	~60-80

Privacy and Cost Benefits

Complete Privacy

No data leaves your machine. Critical for healthcare, finance, legal, and any sensitive data processing.

Zero API Costs

After hardware investment, inference is free. High-volume use cases save significantly vs per-token API pricing.

No Rate Limits

Run as many requests as your hardware supports. No throttling, no quotas, no waiting.

Offline Operation

Works without internet connectivity. Useful for air-gapped environments, field deployments, or unreliable networks.

← Previous Open vs Closed Models Next → LLM APIs