Intermediate

Running Local LLMs

Set up and run LLMs on your own hardware — from hardware requirements and quantization to practical tools like Ollama and LM Studio.

Hardware Requirements

Model SizeFP16 VRAMQ4 VRAMRecommended GPU
3B6 GB2 GBAny modern GPU / CPU-only
7-8B16 GB4-6 GBRTX 3060 12GB, RTX 4060
13B26 GB8-10 GBRTX 3090, RTX 4070 Ti
34B68 GB20 GBRTX 4090, A6000
70B140 GB40 GB2x RTX 4090, A100 80GB
💡
CPU inference: You can run quantized models on CPU using system RAM instead of VRAM. It's 10-50x slower than GPU but works. A 7B Q4 model needs about 6GB RAM. llama.cpp and Ollama both support CPU inference.

Quantization

Quantization reduces model precision to use less memory with minimal quality loss:

FormatBitsSize ReductionQuality ImpactTool
FP1616-bitBaselineNoneNative
Q88-bit~50%NegligibleGGUF, bitsandbytes
Q55-bit~69%Very smallGGUF
Q44-bit~75%SmallGGUF, GPTQ, AWQ
Q33-bit~81%NoticeableGGUF
Q22-bit~87%SignificantGGUF
Sweet spot: Q4 (4-bit) quantization offers the best balance of size reduction and quality. A 7B model goes from 14GB (FP16) to ~4GB (Q4) with minimal quality loss. Q5 is slightly better quality for a small size increase.

Ollama

The simplest way to run LLMs locally. One command to install and run models.

Shell — Getting started with Ollama
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically on first run)
ollama run llama3.1          # LLaMA 3.1 8B
ollama run mistral           # Mistral 7B
ollama run codellama         # Code LLaMA 7B
ollama run phi3              # Phi-3 Mini 3.8B
ollama run qwen2.5:14b       # Qwen 2.5 14B

# List installed models
ollama list

# Use the API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain quantum computing in simple terms"
}'

# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

LM Studio

A desktop application with a GUI for downloading, running, and chatting with local LLMs. Supports GGUF models from Hugging Face.

  • Download from lmstudio.ai
  • Browse and download models from the built-in model catalog
  • Chat interface with conversation history
  • Local OpenAI-compatible API server
  • Supports Windows, macOS, and Linux

llama.cpp

High-performance C/C++ implementation for running LLMs. Powers many other tools including Ollama.

Shell — Using llama.cpp
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Run inference
./llama-cli -m models/llama-3-8b-Q4_K_M.gguf \
  -p "Explain machine learning:" \
  -n 256 \
  --temp 0.7

# Start a server (OpenAI-compatible API)
./llama-server -m models/llama-3-8b-Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 4096

vLLM (Production Serving)

High-throughput serving engine optimized for production workloads with PagedAttention:

Shell — vLLM setup
# Install
pip install vllm

# Start OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Performance Benchmarks

ModelQuantizationGPUTokens/sec
LLaMA 3 8BQ4_K_MRTX 4090~80-100
LLaMA 3 8BQ4_K_MRTX 3060 12GB~30-40
LLaMA 3 8BQ4_K_MM2 Pro (CPU)~15-25
LLaMA 3 70BQ4_K_M2x RTX 4090~15-25
Mistral 7BQ4_K_MRTX 4090~90-110
Phi-3 Mini 3.8BQ4_K_MRTX 3060 12GB~60-80

Privacy and Cost Benefits

Complete Privacy

No data leaves your machine. Critical for healthcare, finance, legal, and any sensitive data processing.

Zero API Costs

After hardware investment, inference is free. High-volume use cases save significantly vs per-token API pricing.

No Rate Limits

Run as many requests as your hardware supports. No throttling, no quotas, no waiting.

Offline Operation

Works without internet connectivity. Useful for air-gapped environments, field deployments, or unreliable networks.