One Team's Real Decisions on Size and Quantization

This is the story of one project, told from situation to outcome, because the decisions around parameters and weights make the most sense as a narrative. The team is a composite drawn from common patterns, but the choices and trade-offs are the real ones teams face.

The goal is to show how the abstract ideas about model size, quantization, and fine-tuning translate into a sequence of concrete decisions under real constraints: a deadline, a budget, and hardware that was not a data center. Follow the arc and you will see how the principles connect.

By the end, the team shipped a working classifier that ran on modest hardware. How they got there is more instructive than the result.

The Situation

A mid-size operations team needed to automatically classify incoming documents into a dozen categories. The volume was high enough that manual sorting was a bottleneck, but the budget did not allow for an expensive hosted API at that scale, and the documents were sensitive enough that they preferred to run the model on their own hardware.

Their constraints were concrete:

A single mid-range GPU available, not a cluster.
A two-week window to ship something usable.
Sensitive data that should not leave their environment.

This ruled out the simplest path of calling a large hosted model and pushed them toward running an open model locally, which made parameter and weight decisions central.

The First Decision: Model Size

The instinct on the team was to grab the largest open model they could find, on the theory that classification accuracy would track parameter count. The lead pushed back and proposed starting small.

They built an evaluation set of 40 real documents with known correct categories before touching any model. Then they tested a 7B model against it. The result surprised them: with a clear prompt listing the categories and two examples, the small model classified correctly on the large majority of cases.

The decision: Use the 7B model and improve it with prompting rather than reaching for size. This echoed the principle from the Best Practices guide of defaulting to the smallest model that works.

The Second Decision: Making It Fit

The 7B model at native 16-bit precision needed roughly 14 GB just for weights, plus memory for the context window. On their mid-range GPU, this was tight to the point of instability under load.

They quantized to 8-bit, which roughly halved the weight memory and gave them comfortable headroom. They re-ran the 40-case evaluation and confirmed the quality held; the classifier still landed the same categories it had at full precision.

They briefly considered 4-bit to free even more memory but decided against it. With 8-bit already fitting comfortably, dropping lower would only risk quality on the trickier documents for no real benefit. This is the deliberate-quantization principle from the How-To guide in action.

The Third Decision: To Fine-Tune or Not

A handful of document types were consistently misclassified, all involving niche internal terminology the base model had never seen. The team debated whether to fine-tune.

They first tried adding a short glossary of the niche terms and a few more examples to the prompt. This fixed most of the remaining errors. Only one stubborn category remained problematic.

The decision: Fine-tune with LoRA, but only narrowly. They assembled about 200 clean examples focused on the hard categories, froze the base weights, and trained a small adapter with a conservative learning rate. They measured against the original 40-case baseline before keeping it.

The adapter pushed accuracy past their threshold without harming the categories that already worked, confirming there was no catastrophic forgetting. Had they fine-tuned before exhausting prompting, they would have spent far more effort, the exact pattern warned about in the Common Mistakes article.

The Execution

With the model chosen, quantized, and lightly fine-tuned, they wrapped it in a simple service. They documented the base model, the 8-bit precision, the LoRA dataset, and the learning rate, and they stored the checksum of the shipped weights so they could reproduce or roll back later.

They used safetensors throughout and verified every download against its published hash, treating the weights as supply-chain artifacts rather than inert files.

The Outcome

The classifier shipped inside the two-week window, ran entirely on their single GPU, and kept sensitive documents in their environment. Accuracy met their threshold across all categories, and inference was fast enough to clear the backlog.

The measurable wins were concrete: no per-call API cost, data never left their hardware, and the small adapter could be swapped or reverted in minutes. The lessons generalized well beyond this one project.

Lessons Learned

Starting small and measuring first saved them from buying capacity they did not need.
Quantizing to 8-bit rather than 4-bit kept quality while solving the memory constraint.
Prompting and a glossary resolved most errors before any weights were touched.
Narrow LoRA fine-tuning fixed the last gap without breaking what already worked.
Documenting and versioning the weights made the result reproducible and reversible.

The Checklist captures this same sequence as a reusable working tool.

What They Would Do Differently

No project is perfect, and in the retrospective the team named two things they would change. First, they wished they had built the evaluation set even larger and with more deliberately chosen edge cases. The 40 documents were enough to make good decisions, but a handful of unusual formats slipped through and surfaced only after launch, requiring a small follow-up adjustment. A broader set would have caught them earlier.

Second, they initially spent half a day investigating gibberish output that turned out to be a tokenizer mismatch, not a problem with the weights at all. Had they run a single tiny test prompt immediately after loading, as the How-To guide recommends, they would have caught the configuration issue in minutes instead of hours. It was a reminder that most "bad weight" problems are actually loading-configuration problems.

Neither issue changed the outcome, but both reinforced the same lesson the project taught throughout: measurement and verification early are cheaper than debugging late.

Frequently Asked Questions

Why not just use a large hosted model?

Cost and data sensitivity ruled it out. At their volume the per-call price was prohibitive, and the documents could not leave their environment. Running an open model locally addressed both, which is why parameter and weight decisions became central to the project.

How did they know the 7B model was enough?

They built an evaluation set of 40 real documents with known answers before choosing a model, then tested the small model against it. When it classified correctly on the large majority with good prompting, they had evidence rather than a guess. Measurement, not intuition, drove the decision.

Why stop at 8-bit instead of going to 4-bit?

Eight-bit already fit comfortably on their GPU with headroom, so dropping to 4-bit would only have risked quality on harder documents for no practical gain. The principle is to quantize to the largest precision that fits, not the smallest available.

What made their fine-tuning successful?

They exhausted prompting first, used a small clean dataset focused only on the hard categories, froze the base weights with LoRA, set a conservative learning rate, and measured against a baseline. That discipline fixed the remaining gap without causing catastrophic forgetting elsewhere.

Could another team reproduce this result?

Yes, because they documented everything: the base model, the 8-bit precision, the LoRA dataset, the learning rate, and the checksum of the shipped weights. That record is what makes a weight-based system reproducible and reversible rather than a one-off that no one can rebuild.

Key Takeaways

Constraints of budget, deadline, and data sensitivity pushed parameter and weight decisions to the center of the project.
An evaluation set built before model selection turned every choice into a measured experiment.
An 8-bit 7B model with good prompting handled most of the task on a single mid-range GPU.
Narrow LoRA fine-tuning closed the final gap without harming categories that already worked.
Documenting and versioning the weights made the shipped system reproducible and reversible.

By the end, the team shipped a working classifier that ran on modest hardware. How they got there is more instructive than the result.

The Situation

Their constraints were concrete:

A single mid-range GPU available, not a cluster.
A two-week window to ship something usable.
Sensitive data that should not leave their environment.

This ruled out the simplest path of calling a large hosted model and pushed them toward running an open model locally, which made parameter and weight decisions central.

The First Decision: Model Size

The instinct on the team was to grab the largest open model they could find, on the theory that classification accuracy would track parameter count. The lead pushed back and proposed starting small.

The decision: Use the 7B model and improve it with prompting rather than reaching for size. This echoed the principle from the Best Practices guide of defaulting to the smallest model that works.

The Second Decision: Making It Fit

The 7B model at native 16-bit precision needed roughly 14 GB just for weights, plus memory for the context window. On their mid-range GPU, this was tight to the point of instability under load.

The Third Decision: To Fine-Tune or Not

A handful of document types were consistently misclassified, all involving niche internal terminology the base model had never seen. The team debated whether to fine-tune.

They first tried adding a short glossary of the niche terms and a few more examples to the prompt. This fixed most of the remaining errors. Only one stubborn category remained problematic.

The Execution

They used safetensors throughout and verified every download against its published hash, treating the weights as supply-chain artifacts rather than inert files.

The Outcome

Lessons Learned

Starting small and measuring first saved them from buying capacity they did not need.
Quantizing to 8-bit rather than 4-bit kept quality while solving the memory constraint.
Prompting and a glossary resolved most errors before any weights were touched.
Narrow LoRA fine-tuning fixed the last gap without breaking what already worked.
Documenting and versioning the weights made the result reproducible and reversible.

The Checklist captures this same sequence as a reusable working tool.

What They Would Do Differently

Neither issue changed the outcome, but both reinforced the same lesson the project taught throughout: measurement and verification early are cheaper than debugging late.

Frequently Asked Questions

Why not just use a large hosted model?

How did they know the 7B model was enough?

Why stop at 8-bit instead of going to 4-bit?

What made their fine-tuning successful?

Could another team reproduce this result?

Key Takeaways

Constraints of budget, deadline, and data sensitivity pushed parameter and weight decisions to the center of the project.
An evaluation set built before model selection turned every choice into a measured experiment.
An 8-bit 7B model with good prompting handled most of the task on a single mid-range GPU.
Narrow LoRA fine-tuning closed the final gap without harming categories that already worked.
Documenting and versioning the weights made the shipped system reproducible and reversible.

One Team's Real Decisions on Size and Quantization

The Situation

The First Decision: Model Size

The Second Decision: Making It Fit

The Third Decision: To Fine-Tune or Not

The Execution

The Outcome

Lessons Learned

What They Would Do Differently

Frequently Asked Questions

Why not just use a large hosted model?

How did they know the 7B model was enough?

Why stop at 8-bit instead of going to 4-bit?

What made their fine-tuning successful?

Could another team reproduce this result?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

One Team's Real Decisions on Size and Quantization

The Situation

The First Decision: Model Size

The Second Decision: Making It Fit

The Third Decision: To Fine-Tune or Not

The Execution

The Outcome

Lessons Learned

What They Would Do Differently

Frequently Asked Questions

Why not just use a large hosted model?

How did they know the 7B model was enough?

Why stop at 8-bit instead of going to 4-bit?

What made their fine-tuning successful?

Could another team reproduce this result?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?