Run This List Top to Bottom Before You Quantize

This is a working checklist, not a reading exercise. Run it top to bottom every time you quantize a model you intend to deploy, and you will avoid the failures that catch teams who improvise. Each item includes a one-line justification so you know why it earns its place and can adapt it when your situation genuinely differs.

The checklist is organized by phase: before you quantize, during the conversion, and before you ship. Skipping the "before you ship" section is where most production regressions are born, so do not treat it as optional.

If any item is unfamiliar, the linked guides explain the underlying concept. Otherwise, just work the list.

A word on why a checklist beats memory here: quantization failures are silent. A skipped step does not throw an error — it ships a model that quietly performs worse, and you discover it from user complaints weeks later. A checklist converts that delayed, expensive feedback into an immediate, cheap one. That is the entire value proposition, so resist the urge to "just remember" the steps.

Before You Quantize

Get these decided before touching any code.

[ ] Define a quality budget. Set a concrete threshold, like "no more than two points lost on our benchmark." Without a number, you cannot judge success. See the step-by-step how-to.
[ ] Identify the real bottleneck. Memory, throughput, or cost — quantize for the constraint you actually have, not a generic one.
[ ] Choose the target precision. Default to 4-bit; justify any move lower with the budget and a plan for quantization-aware training.
[ ] Pick the format from the deployment backward. GGUF for CPU/llama.cpp, GPTQ or AWQ for GPU serving, INT8 where integer kernels are strong.
[ ] Confirm kernel support. Verify your runtime has optimized kernels for the chosen format, or you risk a slower-than-FP16 result.

Prepare Calibration Data

The single highest-leverage prep step for post-training quantization.

[ ] Collect 128 to 512 in-domain samples. They should mirror real production inputs, not generic web text.
[ ] Cover real variety. Different lengths, topics, and formats so the value distribution reflects actual traffic.
[ ] Exclude duplicates and giant single documents. Variety beats volume for measuring distributions.

The common mistakes guide explains why generic calibration quietly costs you accuracy.

During The Conversion

Execute the quantization deliberately.

[ ] Set group size on purpose. Use 128 as the default, 64 when quality matters more than file size.
[ ] Use an outlier-aware method. AWQ, GPTQ, or SmoothQuant protect the salient weights naive rounding destroys.
[ ] Save in the target serving format. Avoid a second conversion that can introduce errors.
[ ] Record the exact settings. Bit width, group size, method, and calibration set, so the result is reproducible.

Before You Ship

The section that prevents production regressions.

[ ] Run task-level evaluation. Score your real downstream tasks against the full-precision baseline, not just perplexity.
[ ] Stress the fragile capabilities. Multi-step reasoning, precise instruction-following, long-context retrieval, and edge cases degrade first.
[ ] Benchmark on production-identical hardware. Memory and throughput differ across chips; dev-box numbers are not predictive.
[ ] Confirm you hit the bottleneck target. Verify the actual memory or speed win you quantized for materialized.
[ ] Archive the full-precision weights. Storage is cheap; an unrecoverable regression is not.
[ ] Deploy behind a flag. Route a slice of traffic and compare live before full cutover.
[ ] Keep rollback to one operation. Switching back should not require re-quantizing under pressure.

After Deployment

Quantization is not done at cutover.

[ ] Monitor quality-sensitive metrics. Escalation rates, thumbs-down, and downstream errors reveal slow degradation.
[ ] Re-quantize on triggers. A base-model update, a traffic-pattern shift, or a hardware change warrants revisiting.
[ ] Document the decision. Record why this bit width and format won, so the next person does not relitigate it.

For the reasoning behind these defaults, the best practices guide explains each in depth, and the case study shows the checklist in action.

How To Use This Checklist In A Team

A checklist only works if it is actually run, so wire it into how your team operates rather than leaving it in a document nobody opens.

[ ] Make it a required step in your model-release process. Treat quantization the way you treat a deploy — gated, not improvised.
[ ] Assign an owner for each phase. Whoever prepares calibration data is accountable for it being in-domain; whoever ships owns the rollback plan.
[ ] Record the answers, not just the ticks. Capture the chosen bit width, group size, method, and the measured quality delta so the decision is auditable later.
[ ] Review failures against the list. When a quantized model underperforms, walk the list to find the skipped item; it is almost always there.

The Three Items You Can Never Skip

If you remember nothing else from this checklist, hold onto three items. They prevent the majority of real failures.

In-domain calibration data, because generic data quietly costs accuracy you will never see in your metrics.
Task-level evaluation against the baseline, because perplexity hides the degradation that actually hurts users.
A one-operation rollback path, because the cost of an unrecoverable regression dwarfs the cost of keeping the original weights.

Everything else on the list improves your odds, but these three are the difference between a controlled change and a gamble. The framework sequences them into a decision spine, and the common mistakes guide shows what happens when they are skipped.

Frequently Asked Questions

What is the most-skipped item on this list?

Defining a quality budget before quantizing. Teams jump to conversion and then have no objective way to decide whether the result is acceptable. A concrete threshold turns a subjective judgment into a pass/fail check.

Do I really need both perplexity and task evaluation?

Perplexity is a fast sanity check, but it misses degradation in reasoning and instruction-following. Task-level evaluation against your real workload is what actually protects users. Use perplexity to catch gross failures and task evaluation to catch subtle ones.

How important is the "deploy behind a flag" step?

Very. It lets you compare the quantized model against the original on live traffic and catch regressions the moment they appear, with an instant rollback. For any customer-facing model, treat it as mandatory rather than optional.

Can I shorten this checklist for low-stakes internal tools?

Yes. For throwaway or internal use you can skip the staged rollout and monitoring, but keep the quality budget, in-domain calibration, and a basic task evaluation. Those three deliver most of the value for the least effort.

When should I re-run the whole checklist?

Whenever you update the base model, change deployment hardware, or notice your traffic no longer matches your calibration data. A stable model on stable hardware does not need re-quantizing on a schedule.

Does this checklist apply to quantization-aware training too?

Mostly. The Set, Choose, and Evaluate phases are identical. The Assemble and Lock-in phases change because QAT bakes low precision into training rather than calibrating after the fact, so you prepare training data and a training run instead of a calibration set. The discipline of defining a budget and shipping reversibly stays exactly the same.

What is the cost of running the full checklist?

For most post-training quantization jobs, the checklist adds hours, not days — the bulk of the time is calibration data and evaluation setup, not the conversion itself. That cost is trivial against the alternative of shipping a silently degraded model to users and diagnosing it after the fact.

Key Takeaways

Define a quality budget and identify the real bottleneck before quantizing.
Default to 4-bit and pick the format from the deployment target backward.
In-domain calibration data is the highest-leverage prep step.
Verify with task evaluation and hard-case stress tests, not just perplexity.
Archive the original, ship behind a flag, and keep rollback to one operation.

If any item is unfamiliar, the linked guides explain the underlying concept. Otherwise, just work the list.

Before You Quantize

Get these decided before touching any code.

[ ] Define a quality budget. Set a concrete threshold, like "no more than two points lost on our benchmark." Without a number, you cannot judge success. See the step-by-step how-to.
[ ] Identify the real bottleneck. Memory, throughput, or cost — quantize for the constraint you actually have, not a generic one.
[ ] Choose the target precision. Default to 4-bit; justify any move lower with the budget and a plan for quantization-aware training.
[ ] Pick the format from the deployment backward. GGUF for CPU/llama.cpp, GPTQ or AWQ for GPU serving, INT8 where integer kernels are strong.
[ ] Confirm kernel support. Verify your runtime has optimized kernels for the chosen format, or you risk a slower-than-FP16 result.

Prepare Calibration Data

The single highest-leverage prep step for post-training quantization.

[ ] Collect 128 to 512 in-domain samples. They should mirror real production inputs, not generic web text.
[ ] Cover real variety. Different lengths, topics, and formats so the value distribution reflects actual traffic.
[ ] Exclude duplicates and giant single documents. Variety beats volume for measuring distributions.

The common mistakes guide explains why generic calibration quietly costs you accuracy.

During The Conversion

Execute the quantization deliberately.

[ ] Set group size on purpose. Use 128 as the default, 64 when quality matters more than file size.
[ ] Use an outlier-aware method. AWQ, GPTQ, or SmoothQuant protect the salient weights naive rounding destroys.
[ ] Save in the target serving format. Avoid a second conversion that can introduce errors.
[ ] Record the exact settings. Bit width, group size, method, and calibration set, so the result is reproducible.

Before You Ship

The section that prevents production regressions.

[ ] Run task-level evaluation. Score your real downstream tasks against the full-precision baseline, not just perplexity.
[ ] Stress the fragile capabilities. Multi-step reasoning, precise instruction-following, long-context retrieval, and edge cases degrade first.
[ ] Benchmark on production-identical hardware. Memory and throughput differ across chips; dev-box numbers are not predictive.
[ ] Confirm you hit the bottleneck target. Verify the actual memory or speed win you quantized for materialized.
[ ] Archive the full-precision weights. Storage is cheap; an unrecoverable regression is not.
[ ] Deploy behind a flag. Route a slice of traffic and compare live before full cutover.
[ ] Keep rollback to one operation. Switching back should not require re-quantizing under pressure.

After Deployment

Quantization is not done at cutover.

[ ] Monitor quality-sensitive metrics. Escalation rates, thumbs-down, and downstream errors reveal slow degradation.
[ ] Re-quantize on triggers. A base-model update, a traffic-pattern shift, or a hardware change warrants revisiting.
[ ] Document the decision. Record why this bit width and format won, so the next person does not relitigate it.

For the reasoning behind these defaults, the best practices guide explains each in depth, and the case study shows the checklist in action.

How To Use This Checklist In A Team

A checklist only works if it is actually run, so wire it into how your team operates rather than leaving it in a document nobody opens.

[ ] Make it a required step in your model-release process. Treat quantization the way you treat a deploy — gated, not improvised.
[ ] Assign an owner for each phase. Whoever prepares calibration data is accountable for it being in-domain; whoever ships owns the rollback plan.
[ ] Record the answers, not just the ticks. Capture the chosen bit width, group size, method, and the measured quality delta so the decision is auditable later.
[ ] Review failures against the list. When a quantized model underperforms, walk the list to find the skipped item; it is almost always there.

The Three Items You Can Never Skip

If you remember nothing else from this checklist, hold onto three items. They prevent the majority of real failures.

In-domain calibration data, because generic data quietly costs accuracy you will never see in your metrics.
Task-level evaluation against the baseline, because perplexity hides the degradation that actually hurts users.
A one-operation rollback path, because the cost of an unrecoverable regression dwarfs the cost of keeping the original weights.

Frequently Asked Questions

What is the most-skipped item on this list?

Do I really need both perplexity and task evaluation?

How important is the "deploy behind a flag" step?

Can I shorten this checklist for low-stakes internal tools?

When should I re-run the whole checklist?

Does this checklist apply to quantization-aware training too?

What is the cost of running the full checklist?

Key Takeaways

Define a quality budget and identify the real bottleneck before quantizing.
Default to 4-bit and pick the format from the deployment target backward.
In-domain calibration data is the highest-leverage prep step.
Verify with task evaluation and hard-case stress tests, not just perplexity.
Archive the original, ship behind a flag, and keep rollback to one operation.

Run This List Top to Bottom Before You Quantize

Before You Quantize

Prepare Calibration Data

During The Conversion

Before You Ship

After Deployment

How To Use This Checklist In A Team

The Three Items You Can Never Skip

Frequently Asked Questions

What is the most-skipped item on this list?

Do I really need both perplexity and task evaluation?

How important is the "deploy behind a flag" step?

Can I shorten this checklist for low-stakes internal tools?

When should I re-run the whole checklist?

Does this checklist apply to quantization-aware training too?

What is the cost of running the full checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Run This List Top to Bottom Before You Quantize

Before You Quantize

Prepare Calibration Data

During The Conversion

Before You Ship

After Deployment

How To Use This Checklist In A Team

The Three Items You Can Never Skip

Frequently Asked Questions

What is the most-skipped item on this list?

Do I really need both perplexity and task evaluation?

How important is the "deploy behind a flag" step?

Can I shorten this checklist for low-stakes internal tools?

When should I re-run the whole checklist?

Does this checklist apply to quantization-aware training too?

What is the cost of running the full checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?