AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Default To 4-Bit, Not LowerInvest More In Calibration Than In Method SelectionProtect The Outliers DeliberatelyEvaluate On Tasks, Then On Hard TasksWhat To StressMatch Group Size To Your Quality BudgetQuantize Per Deployment, Not Once GloballyKeep The Original And Ship Behind A FlagConsider Mixed Precision Before Going LowerTreat Quantization As A Repeatable Process, Not An EventFrequently Asked QuestionsIs AWQ or GPTQ the better default?How small can I safely go without quantization-aware training?Does a smaller group size always improve quality?How often should I re-quantize?Can I combine quantization with other compression?Key Takeaways
Home/Blog/Lost Three Points of Accuracy? Quantization Advice With Reasons
General

Lost Three Points of Accuracy? Quantization Advice With Reasons

A

Agency Script Editorial

Editorial Team

Β·September 8, 2025Β·7 min read
ai model quantization explainedai model quantization explained best practicesai model quantization explained guideai fundamentals

There is a lot of generic quantization advice floating around that amounts to "use a good method and test your results." Useful as far as it goes, and useless when you are staring at a model that lost three points of accuracy and you do not know why. This guide is the opposite: specific, opinionated practices with the reasoning attached, so you can adapt them rather than cargo-cult them.

These come from the recurring patterns of teams who ship quantized models in production and the patterns of those who get burned. Where a practice is genuinely contested, we say so and tell you which side we land on and why.

Treat the list as defaults to deviate from deliberately, not commandments. The value of a default is that it frees your attention for the decisions that genuinely vary by situation. When you do deviate, do it because your constraint demands it, not because a default felt arbitrary.

Default To 4-Bit, Not Lower

The best quality-per-byte for most workloads sits at 4-bit with a modern method. Going to 3-bit or 2-bit saves memory but the quality cliff is steep without quantization-aware training, and most teams do not have the budget for QAT.

The reasoning: weight distributions in transformers are roughly bell-shaped, and 4-bit formats like NF4 are tuned to that shape. Below 4-bit, you run out of buckets to represent the meaningful spread of values, and outliers get crushed. Start at 4-bit, prove you need lower, and only then pay the QAT tax.

If you are still building intuition for why bit width matters, The Complete Guide explains the precision formats.

Invest More In Calibration Than In Method Selection

People agonize over GPTQ versus AWQ and then calibrate on whatever dataset shipped with the tool. That is backward. The method matters, but in-domain calibration data usually moves quality more than swapping methods.

  • Use 128 to 512 samples that mirror real production inputs.
  • Cover the variety of your traffic β€” different lengths, topics, and formats.
  • Refresh calibration data when your usage patterns shift meaningfully.

The common mistakes guide shows how badly generic calibration hurts.

Protect The Outliers Deliberately

A small fraction of weights and activations carry outsized importance, and naive quantization destroys them. Prefer methods that handle outliers on purpose.

  • AWQ scales salient channels to preserve their precision.
  • SmoothQuant moves activation difficulty onto weights.
  • GPTQ compensates for accumulated error layer by layer.

Picking a method is partly picking an outlier strategy. Do not treat them as interchangeable. The practical implication is that when a quantized model degrades unexpectedly, the outlier handling is one of the first things to suspect β€” a method that crushes salient channels will fail on exactly the inputs that depend on them, which is often the most important traffic you have.

Evaluate On Tasks, Then On Hard Tasks

Run your real downstream evaluation, and then specifically stress the capabilities quantization damages first.

What To Stress

  • Multi-step reasoning and chained logic.
  • Precise instruction-following and format adherence.
  • Long-context retrieval and consistency.
  • Rare or edge-case inputs.

These degrade before fluency does, so a model can sound perfect while reasoning worse. The examples article shows where this bites in practice.

Match Group Size To Your Quality Budget

Group size controls how many weights share a scale factor. Smaller groups mean better quality and slightly larger files.

  • Group size 128 is the sensible default for most 4-bit work.
  • Smaller groups (64) when quality matters more than the last bit of size.
  • Larger groups only when storage is genuinely constrained and you have verified the quality holds.

This is a real, tunable knob most people leave at default without thinking. Think about it.

Quantize Per Deployment, Not Once Globally

A common anti-pattern is producing one quantized artifact and forcing it onto every runtime. Different backends want different formats.

  • GGUF k-quants for CPU and llama.cpp deployments.
  • GPTQ or AWQ for GPU serving with the matching kernels.
  • INT8 where hardware integer support is strong and quality is paramount.

The conversion is cheap relative to the cost of a format that runs poorly. The tooling guide maps formats to runtimes.

Keep The Original And Ship Behind A Flag

Never quantize destructively. Archive the full-precision weights, deploy the quantized model behind a flag, and run it alongside the original on a fraction of traffic before full cutover.

The reasoning is risk asymmetry: the cost of keeping a few gigabytes of weights is trivial, while the cost of an unrecoverable production regression is severe. Make rollback a one-line operation, not a re-quantization scramble.

Consider Mixed Precision Before Going Lower

When 4-bit damages one specific capability but you cannot afford full precision everywhere, do not jump to a global lower bit width. Reach for mixed precision instead.

Most of a model's memory sits in the bulk of its layers, so keeping a small number of sensitive layers at higher precision costs little space while preserving the fragile behavior. The reasoning is that quantization damage is not evenly distributed β€” certain layers and certain capabilities break first. Spending your quality budget precisely on those is far more efficient than uniformly degrading everything.

The practice: identify the layers or capabilities that regress under uniform quantization, protect just those, and quantize the rest aggressively. This often beats both uniform 4-bit and uniform higher precision on the quality-per-byte curve.

Treat Quantization As A Repeatable Process, Not An Event

The teams that quantize well do not treat each model as a one-off science experiment. They have a process they run every time, which is why their results are consistent.

  • Record the exact settings β€” bit width, group size, method, calibration set β€” so every result is reproducible.
  • Re-run the full process on triggers: a base-model update, a traffic shift, or a hardware change.
  • Document why a given configuration won, so the next person does not relitigate the decision.

Codifying the process is itself a best practice. The framework and checklist exist precisely to make this repeatable rather than improvised.

Frequently Asked Questions

Is AWQ or GPTQ the better default?

Both are excellent at 4-bit, and the gap is usually smaller than the gap from good versus bad calibration data. AWQ tends to be slightly more robust on diverse inputs, while GPTQ has very broad tooling support. Pick based on your runtime support and calibration quality rather than method reputation.

How small can I safely go without quantization-aware training?

4-bit is the practical floor for post-training quantization on most strong models with acceptable quality loss. Below that, you generally need QAT to stay usable. Treat 4-bit as the default and prove you need lower before going there.

Does a smaller group size always improve quality?

It generally improves quality at the cost of slightly larger files and a small speed overhead. The gains diminish, so going below 64 rarely justifies the cost. Group size 128 is a strong default and 64 is the usual upgrade when quality matters.

How often should I re-quantize?

Re-quantize when you update the base model, when your traffic patterns shift enough that your calibration data no longer represents production, or when you change hardware in a way that favors a different format. Otherwise a quantized artifact is stable.

Can I combine quantization with other compression?

Yes. Quantization stacks with distillation and pruning, since they target different things β€” precision, knowledge transfer, and weight removal respectively. Combine carefully and evaluate at each step, because the quality losses can compound.

Key Takeaways

  • Default to 4-bit; only go lower with a quality budget and QAT.
  • In-domain calibration data usually beats agonizing over method choice.
  • Pick methods for their outlier strategy, not interchangeably.
  • Tune group size as a real knob, with 128 as the default.
  • Quantize per deployment target, keep the original, and ship behind a flag with easy rollback.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification