Run This Top to Bottom Before You Ship

This is a tool, not an essay. It is a checklist you can run top to bottom whenever you select, run, adapt, or ship a model, covering the decisions about parameters and weights that actually determine whether the project works. Each item includes a one-line justification so you know why it earns its place.

Work through the phases in order. Skipping the early items, especially building an evaluation set, is what causes the failures the later items are designed to catch. Copy this into your project notes and tick items off as you go.

The phases are: selection, verification, memory and precision, evaluation, fine-tuning, and deployment. Most projects touch all six.

Phase 1: Model Selection

[ ] Define the task and its difficulty. You cannot size a model for a task you have not described concretely.
[ ] Start with the smallest plausible model. Parameter count is a capacity ceiling, not a quality guarantee; oversizing wastes money and speed.
[ ] Check the model's training quality, not just its size. Data and compute decide how much of the parameter capacity is actually usable.
[ ] Confirm the model is appropriate for your domain. A general model may need adaptation a domain model would not.

The first two items prevent the single most expensive mistake in the field, covered in the Common Mistakes article.

Phase 2: Weight File Verification

[ ] Prefer safetensors format. It cannot execute code on load, unlike legacy pickle-based files.
[ ] Checksum the download against the published hash. Confirms the file is complete and untampered.
[ ] Only load pickle-based files from fully trusted sources. Those formats can run arbitrary code when loaded.
[ ] Record the source and exact version of the weights. You will need it for reproducibility and audits.

Treat every weight file as code from the internet. The Best Practices guide frames this as supply-chain hygiene.

Phase 3: Memory and Precision

[ ] Estimate weight memory. Roughly parameters in billions times 2 for 16-bit; this is your starting budget.
[ ] Budget memory for the context window too. The context cache grows with input length and is easy to forget.
[ ] Run at native precision if it fits. Full precision gives full quality with the simplest setup.
[ ] If it does not fit, quantize to 8-bit before 4-bit. Quantize to the largest precision that fits, not the smallest available.
[ ] Verify quality after quantizing, on hard cases. Quality loss often hides on easy cases and appears on edge cases.

The How-To guide walks through these memory and precision decisions in sequence.

Phase 4: Evaluation

[ ] Build an evaluation set of 20 to 50 real cases. Without measurement, every later decision is a guess.
[ ] Include hard and edge cases. Those are where quantization and fine-tuning quietly break.
[ ] Record the base model's baseline performance. You cannot tell if a change helped without a before.
[ ] Re-run the eval after every change. Model, precision, or weight changes can all regress quality.

Your evaluation set is more valuable than any single model version. Maintain it like code.

Phase 5: Fine-Tuning (Only If Needed)

[ ] Confirm prompting and retrieval cannot do the job first. Fine-tuning adds permanent cost; many fine-tunes should be prompts.
[ ] Assemble a small, clean, consistent dataset. A few hundred excellent examples beat thousands of noisy ones.
[ ] Use a parameter-efficient method like LoRA. It freezes the base weights, runs on modest hardware, and produces a swappable adapter.
[ ] Set a conservative learning rate. Aggressive rates cause catastrophic forgetting of general ability.
[ ] Validate against your baseline before keeping it. Only ship a fine-tune that measurably wins without harming what worked.

This phase is a last resort, not a default, as illustrated in the Examples article.

Phase 6: Deployment and Maintenance

[ ] Version the shipped model with a clear name. Treat models like releases so you can roll back.
[ ] Store the checksum of what you shipped. Lets you verify later that you are running what you think you are.
[ ] Document base model, precision, dataset, and settings. Anything undocumented cannot be reproduced.
[ ] Re-evaluate when the base model updates. Upstream changes can regress your downstream behavior.

How to Use This Checklist

Run it linearly the first time through a project, then keep it open as a reference for every change afterward. The items are deliberately strict because the failure modes they prevent are common and expensive. If you only enforce three, make them: start small, build an evaluation set, and verify weight files before loading.

The checklist works best when you treat it as a gate rather than a suggestion. Do not advance to Phase 3 until Phase 2 is fully ticked, and do not enter Phase 5 until Phase 4 has produced a real baseline. Each unchecked item in an early phase tends to surface as a problem in a later one, usually at the worst possible time. A skipped checksum becomes a security incident; a skipped baseline becomes an argument about whether fine-tuning helped that no one can settle.

Adapting the checklist to your project size

Not every project needs every phase at full depth, and the checklist scales down gracefully.

Quick experiments typically need Phases 1 through 3 and a lightweight version of Phase 4. You are confirming a model works on your task, not shipping it.
Internal tools add the full Phase 4 and a basic Phase 6, because you will maintain them and want reproducibility.
Production systems warrant every item, especially the deployment and re-evaluation steps in Phase 6, since upstream model updates can silently regress behavior you depend on.

The point is not bureaucracy. It is making sure that the small number of decisions that actually determine success, model size, precision, measurement, and verification, get made deliberately instead of by accident. The Framework article organizes these same items into named stages if you prefer reasoning in terms of a process rather than a list.

Frequently Asked Questions

Do I need every item on every project?

No. Phases 1 through 4 apply to virtually any project that runs a model. Phase 5 only applies if you fine-tune, and Phase 6 scales with how seriously you ship. Run the full list once, then keep the relevant phases handy for ongoing changes.

What is the single most important item?

Building an evaluation set in Phase 4. Without it, you cannot tell whether your model choice, your quantization level, or your fine-tuning helped or hurt. Every other decision on this checklist gets sharper once you can measure outcomes on real cases.

How do I estimate memory quickly?

Multiply parameters in billions by 2 for 16-bit precision, by 1 for 8-bit, and by about 0.5 for 4-bit. That gives the weight memory. Then add headroom for the context window, which grows with input length and is the most commonly forgotten cost.

When should I skip fine-tuning entirely?

Whenever good prompting and retrieval get you to your quality target. Fine-tuning adds data preparation, training, evaluation, and maintenance overhead. If the base model already does the job with the right instructions and context, skipping Phase 5 saves real effort with no downside.

Why checksum weight files?

A checksum confirms the file is complete and untampered, which matters because weight files are arbitrary binary data and legacy formats can execute code on load. Combined with using safetensors, checksumming turns weight handling into a safe, routine step instead of a security gamble.

Key Takeaways

Run the six phases in order: selection, verification, memory and precision, evaluation, fine-tuning, deployment.
Start with the smallest plausible model and verify every weight file before loading.
Quantize to the largest precision that fits and budget memory for the context window too.
Build an evaluation set before anything else; it makes every other decision measurable.
Fine-tune only as a last resort, and version and document everything you ship.

The phases are: selection, verification, memory and precision, evaluation, fine-tuning, and deployment. Most projects touch all six.

Phase 1: Model Selection

[ ] Define the task and its difficulty. You cannot size a model for a task you have not described concretely.
[ ] Start with the smallest plausible model. Parameter count is a capacity ceiling, not a quality guarantee; oversizing wastes money and speed.
[ ] Check the model's training quality, not just its size. Data and compute decide how much of the parameter capacity is actually usable.
[ ] Confirm the model is appropriate for your domain. A general model may need adaptation a domain model would not.

The first two items prevent the single most expensive mistake in the field, covered in the Common Mistakes article.

Phase 2: Weight File Verification

[ ] Prefer safetensors format. It cannot execute code on load, unlike legacy pickle-based files.
[ ] Checksum the download against the published hash. Confirms the file is complete and untampered.
[ ] Only load pickle-based files from fully trusted sources. Those formats can run arbitrary code when loaded.
[ ] Record the source and exact version of the weights. You will need it for reproducibility and audits.

Treat every weight file as code from the internet. The Best Practices guide frames this as supply-chain hygiene.

Phase 3: Memory and Precision

[ ] Estimate weight memory. Roughly parameters in billions times 2 for 16-bit; this is your starting budget.
[ ] Budget memory for the context window too. The context cache grows with input length and is easy to forget.
[ ] Run at native precision if it fits. Full precision gives full quality with the simplest setup.
[ ] If it does not fit, quantize to 8-bit before 4-bit. Quantize to the largest precision that fits, not the smallest available.
[ ] Verify quality after quantizing, on hard cases. Quality loss often hides on easy cases and appears on edge cases.

The How-To guide walks through these memory and precision decisions in sequence.

Phase 4: Evaluation

[ ] Build an evaluation set of 20 to 50 real cases. Without measurement, every later decision is a guess.
[ ] Include hard and edge cases. Those are where quantization and fine-tuning quietly break.
[ ] Record the base model's baseline performance. You cannot tell if a change helped without a before.
[ ] Re-run the eval after every change. Model, precision, or weight changes can all regress quality.

Your evaluation set is more valuable than any single model version. Maintain it like code.

Phase 5: Fine-Tuning (Only If Needed)

[ ] Confirm prompting and retrieval cannot do the job first. Fine-tuning adds permanent cost; many fine-tunes should be prompts.
[ ] Assemble a small, clean, consistent dataset. A few hundred excellent examples beat thousands of noisy ones.
[ ] Use a parameter-efficient method like LoRA. It freezes the base weights, runs on modest hardware, and produces a swappable adapter.
[ ] Set a conservative learning rate. Aggressive rates cause catastrophic forgetting of general ability.
[ ] Validate against your baseline before keeping it. Only ship a fine-tune that measurably wins without harming what worked.

This phase is a last resort, not a default, as illustrated in the Examples article.

Phase 6: Deployment and Maintenance

[ ] Version the shipped model with a clear name. Treat models like releases so you can roll back.
[ ] Store the checksum of what you shipped. Lets you verify later that you are running what you think you are.
[ ] Document base model, precision, dataset, and settings. Anything undocumented cannot be reproduced.
[ ] Re-evaluate when the base model updates. Upstream changes can regress your downstream behavior.

How to Use This Checklist

Adapting the checklist to your project size

Not every project needs every phase at full depth, and the checklist scales down gracefully.

Quick experiments typically need Phases 1 through 3 and a lightweight version of Phase 4. You are confirming a model works on your task, not shipping it.
Internal tools add the full Phase 4 and a basic Phase 6, because you will maintain them and want reproducibility.
Production systems warrant every item, especially the deployment and re-evaluation steps in Phase 6, since upstream model updates can silently regress behavior you depend on.

Frequently Asked Questions

Do I need every item on every project?

What is the single most important item?

How do I estimate memory quickly?

When should I skip fine-tuning entirely?

Why checksum weight files?

Key Takeaways

Run the six phases in order: selection, verification, memory and precision, evaluation, fine-tuning, deployment.
Start with the smallest plausible model and verify every weight file before loading.
Quantize to the largest precision that fits and budget memory for the context window too.
Build an evaluation set before anything else; it makes every other decision measurable.
Fine-tune only as a last resort, and version and document everything you ship.

Run This Top to Bottom Before You Ship

Phase 1: Model Selection

Phase 2: Weight File Verification

Phase 3: Memory and Precision

Phase 4: Evaluation

Phase 5: Fine-Tuning (Only If Needed)

Phase 6: Deployment and Maintenance

How to Use This Checklist

Adapting the checklist to your project size

Frequently Asked Questions

Do I need every item on every project?

What is the single most important item?

How do I estimate memory quickly?

When should I skip fine-tuning entirely?

Why checksum weight files?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Run This Top to Bottom Before You Ship

Phase 1: Model Selection

Phase 2: Weight File Verification

Phase 3: Memory and Precision

Phase 4: Evaluation

Phase 5: Fine-Tuning (Only If Needed)

Phase 6: Deployment and Maintenance

How to Use This Checklist

Adapting the checklist to your project size

Frequently Asked Questions

Do I need every item on every project?

What is the single most important item?

How do I estimate memory quickly?

When should I skip fine-tuning entirely?

Why checksum weight files?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?