Quantization is the single most effective lever you have for shrinking a model's memory footprint and speeding up inference without retraining it from scratch. A model that ships at 16-bit floating point can often run at 4-bit integers with a few percentage points of quality loss and a roughly four-times reduction in size. That trade is what lets a 70-billion-parameter model run on a single consumer GPU instead of a rack of them.
This guide explains the mechanics from the ground up: what numeric precision actually means, the difference between post-training quantization and quantization-aware training, where accuracy leaks out, and how to read the alphabet soup of formats like INT8, NF4, GPTQ, and AWQ. The goal is to make you fluent enough to choose a quantization strategy on purpose rather than copying whatever a forum thread recommended.
If you understand nothing else, understand this: quantization is lossy compression applied to the numbers inside a neural network. The skill is in losing the right bits.
What Quantization Actually Does
A trained model is a giant collection of numbers β weights and activations β usually stored as 32-bit or 16-bit floating point values. Each weight in FP16 takes two bytes. A model with seven billion weights therefore needs roughly 14 GB just to hold the weights in memory.
Quantization maps those high-precision numbers onto a smaller set of low-precision ones. The most common scheme is linear quantization: you find the range of values, divide it into evenly spaced buckets (256 buckets for 8-bit, 16 for 4-bit), and snap every weight to the nearest bucket. You store the bucket index plus a scale factor that lets you reconstruct an approximate original value at runtime.
Where The Savings Come From
The savings are mechanical. Going from 16-bit to 8-bit halves memory. Going to 4-bit quarters it. Smaller weights mean less data moving between memory and compute units, and memory bandwidth β not raw math β is usually the bottleneck during inference. That is why quantized models often run faster, not just smaller.
Precision Formats You Need To Know
The format determines both how small the model gets and how much quality you keep.
- FP32 β Full precision. The training baseline. Rarely used for deployment.
- FP16 / BF16 β Half precision. The standard serving format. BF16 trades mantissa bits for exponent range and is more stable.
- INT8 β Eight-bit integers. The workhorse of production quantization, with mature hardware support.
- INT4 / NF4 β Four-bit. NF4 (normal float 4) is tuned to the bell-curve distribution of neural network weights and outperforms naive INT4.
If you are new to the precision concepts behind these formats, the Ai Model Quantization Explained: A Beginner's Guide builds them up from first principles.
Post-Training Quantization vs. Quantization-Aware Training
There are two fundamentally different times you can quantize.
Post-Training Quantization (PTQ)
PTQ takes a finished model and compresses it after the fact. It is fast β often minutes to hours β and requires no labeled training data, just a small calibration set to measure activation ranges. GPTQ and AWQ are PTQ methods. This is what most people mean when they say "quantize a model."
Quantization-Aware Training (QAT)
QAT simulates low precision during training or fine-tuning, so the model learns weights that survive quantization gracefully. It costs far more compute but recovers accuracy that PTQ leaves on the table, especially at aggressive bit widths. Use QAT when you need 4-bit or lower and PTQ quality is unacceptable.
How Accuracy Degrades
Quantization error is not uniform. A handful of "outlier" weights and activations carry disproportionate signal, and crushing them into coarse buckets is where most quality loss happens. Modern methods specifically protect these.
- AWQ identifies the most salient weight channels and scales them to preserve precision where it matters.
- GPTQ quantizes layer by layer while compensating for accumulated error using second-order information.
- SmoothQuant shifts the difficulty of quantizing activations onto the weights, which are easier to handle.
The failure mode to watch for is silent degradation β the model still produces fluent text but reasons worse, hallucinates more on edge cases, or loses instruction-following. Perplexity alone will not catch this. You need task-level evaluation.
Choosing A Quantization Strategy
Match the method to the constraint.
- You need maximum speed on a server GPU β INT8 with established kernels.
- You need to fit a large model on a small GPU β 4-bit GPTQ or AWQ.
- You are running on CPU or edge hardware β GGUF formats (used by llama.cpp) with k-quants.
- Quality is non-negotiable at low bit width β invest in QAT.
The step-by-step approach walks through executing one of these end to end, and the tooling guide compares the libraries that implement them.
Symmetric vs. Asymmetric And Per-Channel Quantization
Two implementation choices shape how well a scheme preserves quality, and they are worth understanding even at a conceptual level.
Symmetric vs. Asymmetric
Symmetric quantization assumes values are centered on zero and uses a single scale factor with no offset. It is simpler and faster but wastes range when the actual distribution is lopsided. Asymmetric quantization adds a zero-point offset so the buckets line up with the real minimum and maximum, capturing skewed distributions more faithfully at a small cost in compute.
Per-Tensor vs. Per-Channel
Per-tensor quantization uses one scale for an entire weight matrix, which is cheap but crude when different channels have very different ranges. Per-channel quantization gives each output channel its own scale, dramatically reducing error for little extra storage. Most quality-conscious methods quantize weights per channel, and group-wise quantization (a scale per small group of weights) takes this even further. The group size is a real knob: smaller groups mean better quality and slightly larger files.
Why Activations Are Harder Than Weights
Weights are fixed after training, so you can analyze and quantize them carefully offline. Activations are computed at runtime from the input and vary wildly, with large outliers that appear in specific channels. Crushing those outliers into coarse buckets is what wrecks accuracy when you quantize activations naively.
This asymmetry is why weight-only quantization is the gentler, more popular choice and why methods that do quantize activations β like SmoothQuant β work by migrating the difficulty from activations onto weights, which are easier to handle. Understanding this single fact explains most of the design decisions in modern quantization research.
Frequently Asked Questions
Does quantization always make models faster?
Not always. It reliably reduces memory, but speed gains depend on whether your hardware has optimized low-precision kernels. A 4-bit model with poor kernel support can sometimes run slower than FP16 because of dequantization overhead. Always benchmark on your actual target hardware.
How much quality do I lose with 4-bit quantization?
With a good method like AWQ or GPTQ, a strong model typically loses one to three points on most benchmarks at 4-bit β often imperceptible in practice. Below 4-bit, degradation accelerates sharply, and you usually need quantization-aware training to stay usable.
Can I quantize a model myself or do I need pre-quantized weights?
You can do it yourself with a calibration dataset and a library like AutoGPTQ or llama.cpp, usually in under an hour for a mid-sized model. Pre-quantized weights from model hubs save that step, but quantizing yourself lets you tune the calibration data to your domain.
What is the difference between weight-only and full quantization?
Weight-only quantization compresses just the stored weights and dequantizes them during compute, which is simpler and preserves more quality. Full quantization also quantizes activations, enabling integer math end to end for more speed but more accuracy risk.
Is quantization the same as distillation or pruning?
No. These are distinct compression techniques. Distillation trains a smaller model to mimic a larger one, pruning removes weights entirely, and quantization reduces the precision of the weights that remain. They are often combined.
Key Takeaways
- Quantization is lossy compression of the numbers inside a model; 16-bit to 4-bit gives roughly a four-times size reduction.
- Memory bandwidth is usually the bottleneck, so smaller weights often mean faster inference.
- PTQ is fast and data-light; QAT costs more but recovers accuracy at aggressive bit widths.
- Most quality loss comes from a few outlier weights β modern methods like AWQ and GPTQ protect them.
- Always evaluate at the task level, not just perplexity, and benchmark speed on your real target hardware.