Every Unchecked Box Is a Risk You Just Accepted

This is a working checklist, not a reading list. Run it against any training data collection effort before you commit to training, and treat any unchecked item as a risk you are knowingly accepting. Each entry comes with a short justification so you understand why it earns its place, because a checklist you cannot reason about is just busywork.

It is organized by phase: before you collect, while you collect, after you collect, and before you train. Work through it in order. The items are deliberately concrete so you can mark them done or not done without arguing about interpretation.

Before You Collect

[ ] The target behavior is written in one sentence. If you cannot state what the model should do, you cannot know what data to gather. This is the anchor for every later decision.
[ ] Five to ten ideal input-output examples exist, written by hand. These become your specification and your first evaluation set. Skipping them means collecting blind.
[ ] Sources are mapped and ranked by quality and rights. First-party and licensed sources rank above scraped web data. Knowing your sources upfront prevents legal surprises.
[ ] A privacy and copyright review is done for each source. Checking rights before collecting is cheap; discovering a problem after training is expensive and sometimes irreversible.

If any of these is unchecked, stop. Collecting before this foundation is set is the most common cause of wasted effort, as our common mistakes article explains.

While You Collect

[ ] Provenance is logged for every batch at collection time. Source, date, and usage rights. Reconstructing this later is painful and often impossible.
[ ] Personal data is identified and handled with a lawful basis. Collecting personal data without one is a legal liability, not a quality issue you can fix later.
[ ] Collection is scoped to what the behavior needs. Resist gathering everything. Curated and conservative beats sprawling and noisy for almost every applied project.
[ ] For adversarial tasks, collection is continuous, not one-shot. Spam and fraud data go stale fast. A static dataset decays in changing environments.

After You Collect

[ ] Exact and near-duplicates are removed. Duplicates bias the model and waste compute. Near-duplicate detection matters as much as exact-match removal.
[ ] Low-quality and harmful content is filtered out. Junk in the dataset becomes junk in the model. Filtering raises the average signal.
[ ] Formatting and encoding are normalized. Inconsistent formatting introduces noise the model wastes capacity learning around.
[ ] Composition is audited for balance. Break the data down by relevant categories and look for thin or missing groups. Skew produces invisible failures on underrepresented cases.

Labeling Quality (If Your Task Needs Labels)

[ ] Labeling instructions include concrete edge-case examples. Vague instructions produce inconsistent labels that teach the model contradictions.
[ ] You labeled a sample yourself before delegating. This surfaces the ambiguity that annotators would otherwise hit blind.
[ ] Annotator agreement is measured on a shared subset. Low agreement signals unclear instructions, not bad annotators. Fix the instructions, then re-measure.
[ ] Disagreements feed back into sharper guidelines. Silently picking a winner buries the ambiguity instead of resolving it.

The best practices article expands on each of these labeling items.

Before You Train

[ ] Data is split into training, validation, and test sets. Without a clean split, you have no honest way to measure quality.
[ ] The test set is decontaminated against training data. Any overlap inflates your scores and hides real performance. This is the single most important pre-training check.
[ ] The held-out test set is sealed and will not be touched. It is your only honest signal. Peeking at it during development quietly corrupts it.
[ ] Synthetic data, if used, has been verified to improve results. Synthetic data that does not measurably help adds the generator's quirks for no benefit. Cut it if it does not earn its place.

For how these checks fit into the full sequence, see the step-by-step guide.

Ongoing Maintenance (After You Ship)

Collection does not end at launch. A model in production faces a world that keeps changing, and the dataset needs upkeep.

[ ] A schedule exists for refreshing data. Behavior, language, and conditions drift over time. A dataset frozen at launch slowly goes stale, fastest in adversarial settings like spam and fraud.
[ ] Production failures feed back into collection. When the model gets something wrong in the real world, that case should become a new training example. This is the highest-signal data you will ever get.
[ ] Provenance and versioning survive into production. You should always be able to say which dataset version produced the model currently serving users.
[ ] Privacy obligations are re-checked as laws and usage evolve. A lawful basis at collection time can change as regulations or your product change.

Treating maintenance as part of collection is what separates a one-time demo from a system that stays good. The framework article frames this as the loop closing back on itself.

How to Use This Checklist in Practice

A checklist only helps if it changes behavior. A few suggestions for making it stick:

Run it as a gate, not a suggestion. Do not start training until the pre-training section is fully checked, and treat any unchecked item as a documented, accepted risk.
Assign owners. Each section needs someone accountable, so provenance and decontamination do not fall through the cracks.
Revisit it every cycle. As you iterate on the dataset, re-run the relevant sections rather than assuming earlier checks still hold.

Used this way, the checklist becomes a lightweight quality system rather than a document you fill out once and forget.

Frequently Asked Questions

Which checklist item should never be skipped?

Decontaminating the test set against training data. Every quality metric you produce depends on it. Skip it and your scores become fiction, and you will make shipping decisions based on numbers that do not reflect real performance. It is the cheapest insurance against the most expensive mistake.

Can I skip the handwritten examples if I have a lot of data?

No. Handwritten examples define what good output looks like and give you a first evaluation set. Having lots of data makes them more important, not less, because they tell you what to keep and how to judge the result. They take an afternoon and prevent collecting blind.

How formal does provenance logging need to be?

Not very. A metadata file or spreadsheet recording source, date, and usage rights for each batch is enough. The discipline of logging at collection time is what matters, not the sophistication of the tool. The goal is to never be unable to answer where a dataset came from.

What if I cannot check the bias-audit item confidently?

Then treat it as an open risk and test the model specifically on the cases you suspect are underrepresented. Bias rarely shows in aggregate metrics, so an unchecked audit item means you are likely shipping invisible failures. Targeted collection closes the gaps you find.

Is this checklist different for fine-tuning versus pretraining?

The principles hold for both, but emphasis shifts. Pretraining tolerates more volume and noise and leans harder on filtering at scale. Fine-tuning rewards small, clean, carefully labeled datasets, so the labeling and curation items carry even more weight there.

Key Takeaways

Before collecting, define the behavior, write ideal examples, map sources, and review rights.
During collection, log provenance, handle personal data lawfully, and stay conservative on scope.
After collecting, deduplicate, filter, normalize, and audit composition for balance.
Invest in labeling quality with clear instructions and measured annotator agreement.
Before training, split the data, decontaminate the test set, and seal it as your only honest signal.

Before You Collect

[ ] The target behavior is written in one sentence. If you cannot state what the model should do, you cannot know what data to gather. This is the anchor for every later decision.
[ ] Five to ten ideal input-output examples exist, written by hand. These become your specification and your first evaluation set. Skipping them means collecting blind.
[ ] Sources are mapped and ranked by quality and rights. First-party and licensed sources rank above scraped web data. Knowing your sources upfront prevents legal surprises.
[ ] A privacy and copyright review is done for each source. Checking rights before collecting is cheap; discovering a problem after training is expensive and sometimes irreversible.

If any of these is unchecked, stop. Collecting before this foundation is set is the most common cause of wasted effort, as our common mistakes article explains.

While You Collect

[ ] Provenance is logged for every batch at collection time. Source, date, and usage rights. Reconstructing this later is painful and often impossible.
[ ] Personal data is identified and handled with a lawful basis. Collecting personal data without one is a legal liability, not a quality issue you can fix later.
[ ] Collection is scoped to what the behavior needs. Resist gathering everything. Curated and conservative beats sprawling and noisy for almost every applied project.
[ ] For adversarial tasks, collection is continuous, not one-shot. Spam and fraud data go stale fast. A static dataset decays in changing environments.

After You Collect

[ ] Exact and near-duplicates are removed. Duplicates bias the model and waste compute. Near-duplicate detection matters as much as exact-match removal.
[ ] Low-quality and harmful content is filtered out. Junk in the dataset becomes junk in the model. Filtering raises the average signal.
[ ] Formatting and encoding are normalized. Inconsistent formatting introduces noise the model wastes capacity learning around.
[ ] Composition is audited for balance. Break the data down by relevant categories and look for thin or missing groups. Skew produces invisible failures on underrepresented cases.

Labeling Quality (If Your Task Needs Labels)

[ ] Labeling instructions include concrete edge-case examples. Vague instructions produce inconsistent labels that teach the model contradictions.
[ ] You labeled a sample yourself before delegating. This surfaces the ambiguity that annotators would otherwise hit blind.
[ ] Annotator agreement is measured on a shared subset. Low agreement signals unclear instructions, not bad annotators. Fix the instructions, then re-measure.
[ ] Disagreements feed back into sharper guidelines. Silently picking a winner buries the ambiguity instead of resolving it.

The best practices article expands on each of these labeling items.

Before You Train

[ ] Data is split into training, validation, and test sets. Without a clean split, you have no honest way to measure quality.
[ ] The test set is decontaminated against training data. Any overlap inflates your scores and hides real performance. This is the single most important pre-training check.
[ ] The held-out test set is sealed and will not be touched. It is your only honest signal. Peeking at it during development quietly corrupts it.
[ ] Synthetic data, if used, has been verified to improve results. Synthetic data that does not measurably help adds the generator's quirks for no benefit. Cut it if it does not earn its place.

For how these checks fit into the full sequence, see the step-by-step guide.

Ongoing Maintenance (After You Ship)

Collection does not end at launch. A model in production faces a world that keeps changing, and the dataset needs upkeep.

[ ] A schedule exists for refreshing data. Behavior, language, and conditions drift over time. A dataset frozen at launch slowly goes stale, fastest in adversarial settings like spam and fraud.
[ ] Production failures feed back into collection. When the model gets something wrong in the real world, that case should become a new training example. This is the highest-signal data you will ever get.
[ ] Provenance and versioning survive into production. You should always be able to say which dataset version produced the model currently serving users.
[ ] Privacy obligations are re-checked as laws and usage evolve. A lawful basis at collection time can change as regulations or your product change.

Treating maintenance as part of collection is what separates a one-time demo from a system that stays good. The framework article frames this as the loop closing back on itself.

How to Use This Checklist in Practice

A checklist only helps if it changes behavior. A few suggestions for making it stick:

Run it as a gate, not a suggestion. Do not start training until the pre-training section is fully checked, and treat any unchecked item as a documented, accepted risk.
Assign owners. Each section needs someone accountable, so provenance and decontamination do not fall through the cracks.
Revisit it every cycle. As you iterate on the dataset, re-run the relevant sections rather than assuming earlier checks still hold.

Used this way, the checklist becomes a lightweight quality system rather than a document you fill out once and forget.

Frequently Asked Questions

Which checklist item should never be skipped?

Can I skip the handwritten examples if I have a lot of data?

How formal does provenance logging need to be?

What if I cannot check the bias-audit item confidently?

Is this checklist different for fine-tuning versus pretraining?

Key Takeaways

Before collecting, define the behavior, write ideal examples, map sources, and review rights.
During collection, log provenance, handle personal data lawfully, and stay conservative on scope.
After collecting, deduplicate, filter, normalize, and audit composition for balance.
Invest in labeling quality with clear instructions and measured annotator agreement.
Before training, split the data, decontaminate the test set, and seal it as your only honest signal.

Every Unchecked Box Is a Risk You Just Accepted

Before You Collect

While You Collect

After You Collect

Labeling Quality (If Your Task Needs Labels)

Before You Train

Ongoing Maintenance (After You Ship)

How to Use This Checklist in Practice

Frequently Asked Questions

Which checklist item should never be skipped?

Can I skip the handwritten examples if I have a lot of data?

How formal does provenance logging need to be?

What if I cannot check the bias-audit item confidently?

Is this checklist different for fine-tuning versus pretraining?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Every Unchecked Box Is a Risk You Just Accepted

Before You Collect

While You Collect

After You Collect

Labeling Quality (If Your Task Needs Labels)

Before You Train

Ongoing Maintenance (After You Ship)

How to Use This Checklist in Practice

Frequently Asked Questions

Which checklist item should never be skipped?

Can I skip the handwritten examples if I have a lot of data?

How formal does provenance logging need to be?

What if I cannot check the bias-audit item confidently?

Is this checklist different for fine-tuning versus pretraining?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?