Quantization shrinks a model by storing its weights, and sometimes its activations, in fewer bits than the 16- or 32-bit floats it was trained with. Drop from 16 bits to 4 and a 13-billion-parameter model that needed roughly 26 GB of memory now fits in under 8 GB. That is the entire appeal: the same model runs on cheaper hardware, serves more requests per GPU, and responds faster.
The hard part is not the idea. It is choosing among a dozen methods that all promise "near-lossless" results and rarely agree on what that means. The right choice depends on whether you control training, how much accuracy you can spend, and what hardware you are targeting. This article lays out the competing approaches, the axes that actually separate them, and a decision rule you can apply in an afternoon.
The two families: PTQ versus QAT
Every quantization method falls into one of two camps, and picking the camp is the first real decision.
Post-training quantization (PTQ) takes a finished model and converts its weights without retraining. It is fast, needs no labeled data beyond a small calibration set, and runs in minutes to hours. The cost is accuracy: aggressive PTQ can degrade quality, especially below 8 bits, because the model never learned to tolerate the rounding error.
Quantization-aware training (QAT) simulates the rounding during training or fine-tuning, so the model adapts its weights to the lower precision. It recovers most of the accuracy PTQ loses, but it requires a training pipeline, gradient compute, and time measured in GPU-days. You also need the original training data or a good proxy.
The rule of thumb: reach for PTQ first. It is cheaper and often good enough at 8-bit. Only escalate to QAT when PTQ measurably fails your accuracy bar and the model is important enough to justify a training run. The step-by-step approach walks through running PTQ end to end.
The axes that actually matter
Marketing copy compares methods on a single accuracy number. That hides the trade-offs you will live with. Five axes separate real options.
- Bit width. 8-bit is the safe default and usually loses almost nothing. 4-bit roughly halves memory again but introduces visible quality risk on hard tasks. 2- and 3-bit are research territory for most teams.
- What gets quantized. Weight-only quantization shrinks storage but keeps activations in full precision, so the speedup is modest. Weight-and-activation quantization unlocks faster integer math but is far more sensitive to outliers.
- Granularity. Per-tensor scaling uses one factor for a whole layer and is fast but blunt. Per-channel or per-group scaling uses many factors, preserves more accuracy, and costs a little overhead.
- Calibration cost. Some methods need only a few hundred sample sequences; others want thousands and careful curation.
- Hardware support. A format is useless if your runtime cannot execute it. INT8 is broadly supported; some 4-bit formats only accelerate on specific GPUs or kernels.
Ignore any one of these and you can pick a method that benchmarks well and then runs slowly or breaks on your hardware.
The leading methods and where they fit
A handful of named methods cover most production needs.
GPTQ
A one-shot, weight-only PTQ method that quantizes layer by layer while minimizing reconstruction error. It is the practical default for 4-bit large language models when you want strong accuracy without retraining. It needs a calibration set and a few minutes to hours of compute.
AWQ
Activation-aware weight quantization protects the small fraction of weights that matter most by scaling around important activation channels. It often edges out GPTQ on instruction-following tasks at 4-bit and has good kernel support.
bitsandbytes (NF4 / INT8)
The lowest-friction option. Load a model in 8-bit or 4-bit with a flag, no calibration step. Accuracy is solid at 8-bit and acceptable at 4-bit for many uses. It is the right first try for experimentation and for QLoRA-style fine-tuning.
GGUF / llama.cpp quantization
The CPU and consumer-hardware path. A range of quantization levels (Q4, Q5, Q8 and k-quant variants) let you trade size against quality on machines without a big GPU. This is how most local and on-device deployments ship.
For a deeper tour of the ecosystem, see the best tools, and for honest comparisons grounded in real deployments, real-world examples and use cases.
A decision rule you can apply today
Walk the questions in order and stop at the first that fits.
- Do you just want it smaller for experiments? Use bitsandbytes 8-bit. Zero calibration, near-zero accuracy loss.
- Are you serving an LLM and need 4-bit on a GPU? Start with GPTQ or AWQ. Try both on your eval set; the winner varies by model.
- Are you deploying to CPU, a laptop, or the edge? Use GGUF and pick a k-quant level that meets your quality bar.
- Did 8-bit PTQ still miss your accuracy target on a model you care about? Now QAT earns its cost. Budget the training run.
- Is the model tiny and latency-critical? Consider INT8 weight-and-activation quantization for the integer-math speedup, and verify hardware support first.
The mistake teams make is skipping straight to the most aggressive method because it sounds impressive, then spending a week debugging accuracy regressions. Start conservative, measure, and only push the bit width down when the savings justify the risk. The common mistakes guide catalogs the rest of these traps.
The trade-offs you only see in production
Benchmarks compare methods on a clean test set. Production exposes trade-offs that never show up there, and they often dominate the decision.
Setup cost versus run cost
bitsandbytes has near-zero setup cost but a more modest accuracy ceiling at 4-bit. GPTQ and AWQ cost you a calibration run and some configuration, but deliver better 4-bit quality and faster serving kernels. If you quantize a model once and serve it billions of times, paying the higher setup cost is obviously right. If you are iterating rapidly on many models, the low-friction option wins. The mistake is using the heavyweight method for throwaway experiments and the lightweight one for a flagship production model.
Calibration-set sensitivity
Methods that calibrate, GPTQ and AWQ, are only as good as the data you calibrate on. A calibration set that does not represent your real traffic produces a model that benchmarks fine and degrades on production inputs. This is a real, recurring failure mode: teams calibrate on generic text, deploy on domain-specific traffic, and wonder why quality slipped. Treat the calibration set as a first-class input, drawn from the same distribution you will actually serve.
Portability across hardware
A method that runs beautifully on your development GPU may have no optimized kernel on your deployment hardware. Choosing a method without confirming kernel support on the target is how teams end up with a quantized model that is slower than full precision. Always confirm the format runs accelerated on the hardware you will actually deploy to, not just the one you developed on.
Frequently Asked Questions
Is 4-bit always worse than 8-bit?
In accuracy, yes, but often by a margin too small to notice on real tasks. Modern methods like GPTQ and AWQ recover most of the gap at 4-bit. The real question is whether your evaluation set can detect the difference. If it cannot, take the extra memory savings.
Does quantization make inference faster, or just smaller?
Both, but the speedup depends on what you quantize. Weight-only methods mostly save memory and bandwidth, with modest latency gains. Weight-and-activation integer quantization unlocks faster arithmetic, but only if your hardware and kernels support it.
Do I need the original training data?
Not for PTQ. You need a small calibration set, often a few hundred representative sequences, which can come from any in-domain text. You only need real training data for QAT, where the model is actually retrained against the rounding error.
Can I quantize any model, or only LLMs?
Any neural network can be quantized. The named methods in this article target transformers, but vision models, speech models, and small classifiers all benefit. Smaller models sometimes tolerate aggressive quantization better because there is less redundancy to exploit, so always measure rather than assume.
Which method should a beginner start with?
bitsandbytes 8-bit. It is one flag, needs no calibration, and rarely hurts accuracy. It teaches you the workflow before you invest in calibration sets or training runs. The beginner's guide starts there.
Key Takeaways
- The first decision is PTQ versus QAT: start with the cheaper PTQ and escalate to QAT only when accuracy demands it.
- Compare methods across five axes, not one number: bit width, what gets quantized, granularity, calibration cost, and hardware support.
- GPTQ and AWQ cover 4-bit GPU serving; bitsandbytes is the low-friction default; GGUF owns CPU and edge.
- Apply the decision rule top-down and stop at the first match instead of jumping to the most aggressive option.
- Always validate against your own evaluation set, because "near-lossless" means nothing until your tasks confirm it.