Model Quantization Intermediate

Quantization is the process of reducing the numerical precision of model weights — from 16-bit floating point to 8-bit, 4-bit, or even lower. This dramatically reduces memory usage and speeds up inference, often with minimal impact on quality. It is the single most impactful technique for making language models run on consumer hardware.

Why Quantize?

A 7B parameter model in 16-bit precision requires approximately 14 GB of memory just for the weights. Quantizing to 4-bit reduces this to about 3.5 GB, making it possible to run on a laptop GPU or even a smartphone.

Precision Memory (7B model) Quality Impact Speed
FP16 (16-bit) ~14 GB Baseline (full quality) Baseline
INT8 (8-bit) ~7 GB Negligible loss (<1%) 1.5-2x faster
INT4 (4-bit) ~3.5 GB Small loss (1-3%) 2-3x faster
INT2 (2-bit) ~1.75 GB Moderate loss (5-10%) 3-4x faster

Quantization Methods

  1. GPTQ (Post-Training Quantization)

    A one-shot weight quantization method that uses a small calibration dataset to minimize quantization error. Produces GPU-optimized models with excellent quality at 4-bit and 8-bit levels.

  2. AWQ (Activation-Aware Weight Quantization)

    Identifies the most important weights by analyzing activation patterns and preserves them at higher precision. Often produces better quality than GPTQ at the same bit-width.

  3. GGUF (llama.cpp Format)

    A file format and quantization approach designed for CPU inference with llama.cpp. Supports mixed-precision quantization (e.g., Q4_K_M) that keeps critical layers at higher precision.

  4. bitsandbytes

    A library that enables on-the-fly quantization during model loading. Simple to use with Hugging Face Transformers — just add a flag to load in 4-bit or 8-bit.

Quick Start: 4-bit Loading with bitsandbytes

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=quantization_config,
    device_map="auto"
)
# Model now uses ~2 GB instead of ~8 GB
Rule of Thumb: For most applications, 4-bit quantization (Q4_K_M in GGUF or NF4 in bitsandbytes) offers the best balance of quality and efficiency. Go to 8-bit if you need near-perfect quality, or 2-bit only for extremely constrained environments where some quality loss is acceptable.

Next: On-Device Deployment

In the next lesson, you will learn how to deploy quantized models on mobile devices, browsers, and edge hardware.

Next: On-Device Deployment →