AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Before You CollectWhile You CollectAfter You CollectLabeling Quality (If Your Task Needs Labels)Before You TrainOngoing Maintenance (After You Ship)How to Use This Checklist in PracticeFrequently Asked QuestionsWhich checklist item should never be skipped?Can I skip the handwritten examples if I have a lot of data?How formal does provenance logging need to be?What if I cannot check the bias-audit item confidently?Is this checklist different for fine-tuning versus pretraining?Key Takeaways
Home/Blog/Every Unchecked Box Is a Risk You Just Accepted
General

Every Unchecked Box Is a Risk You Just Accepted

A

Agency Script Editorial

Editorial Team

·August 25, 2025·7 min read
how ai training data is collectedhow ai training data is collected checklisthow ai training data is collected guideai fundamentals

This is a working checklist, not a reading list. Run it against any training data collection effort before you commit to training, and treat any unchecked item as a risk you are knowingly accepting. Each entry comes with a short justification so you understand why it earns its place, because a checklist you cannot reason about is just busywork.

It is organized by phase: before you collect, while you collect, after you collect, and before you train. Work through it in order. The items are deliberately concrete so you can mark them done or not done without arguing about interpretation.

Before You Collect

  • [ ] The target behavior is written in one sentence. If you cannot state what the model should do, you cannot know what data to gather. This is the anchor for every later decision.
  • [ ] Five to ten ideal input-output examples exist, written by hand. These become your specification and your first evaluation set. Skipping them means collecting blind.
  • [ ] Sources are mapped and ranked by quality and rights. First-party and licensed sources rank above scraped web data. Knowing your sources upfront prevents legal surprises.
  • [ ] A privacy and copyright review is done for each source. Checking rights before collecting is cheap; discovering a problem after training is expensive and sometimes irreversible.

If any of these is unchecked, stop. Collecting before this foundation is set is the most common cause of wasted effort, as our common mistakes article explains.

While You Collect

  • [ ] Provenance is logged for every batch at collection time. Source, date, and usage rights. Reconstructing this later is painful and often impossible.
  • [ ] Personal data is identified and handled with a lawful basis. Collecting personal data without one is a legal liability, not a quality issue you can fix later.
  • [ ] Collection is scoped to what the behavior needs. Resist gathering everything. Curated and conservative beats sprawling and noisy for almost every applied project.
  • [ ] For adversarial tasks, collection is continuous, not one-shot. Spam and fraud data go stale fast. A static dataset decays in changing environments.

After You Collect

  • [ ] Exact and near-duplicates are removed. Duplicates bias the model and waste compute. Near-duplicate detection matters as much as exact-match removal.
  • [ ] Low-quality and harmful content is filtered out. Junk in the dataset becomes junk in the model. Filtering raises the average signal.
  • [ ] Formatting and encoding are normalized. Inconsistent formatting introduces noise the model wastes capacity learning around.
  • [ ] Composition is audited for balance. Break the data down by relevant categories and look for thin or missing groups. Skew produces invisible failures on underrepresented cases.

Labeling Quality (If Your Task Needs Labels)

  • [ ] Labeling instructions include concrete edge-case examples. Vague instructions produce inconsistent labels that teach the model contradictions.
  • [ ] You labeled a sample yourself before delegating. This surfaces the ambiguity that annotators would otherwise hit blind.
  • [ ] Annotator agreement is measured on a shared subset. Low agreement signals unclear instructions, not bad annotators. Fix the instructions, then re-measure.
  • [ ] Disagreements feed back into sharper guidelines. Silently picking a winner buries the ambiguity instead of resolving it.

The best practices article expands on each of these labeling items.

Before You Train

  • [ ] Data is split into training, validation, and test sets. Without a clean split, you have no honest way to measure quality.
  • [ ] The test set is decontaminated against training data. Any overlap inflates your scores and hides real performance. This is the single most important pre-training check.
  • [ ] The held-out test set is sealed and will not be touched. It is your only honest signal. Peeking at it during development quietly corrupts it.
  • [ ] Synthetic data, if used, has been verified to improve results. Synthetic data that does not measurably help adds the generator's quirks for no benefit. Cut it if it does not earn its place.

For how these checks fit into the full sequence, see the step-by-step guide.

Ongoing Maintenance (After You Ship)

Collection does not end at launch. A model in production faces a world that keeps changing, and the dataset needs upkeep.

  • [ ] A schedule exists for refreshing data. Behavior, language, and conditions drift over time. A dataset frozen at launch slowly goes stale, fastest in adversarial settings like spam and fraud.
  • [ ] Production failures feed back into collection. When the model gets something wrong in the real world, that case should become a new training example. This is the highest-signal data you will ever get.
  • [ ] Provenance and versioning survive into production. You should always be able to say which dataset version produced the model currently serving users.
  • [ ] Privacy obligations are re-checked as laws and usage evolve. A lawful basis at collection time can change as regulations or your product change.

Treating maintenance as part of collection is what separates a one-time demo from a system that stays good. The framework article frames this as the loop closing back on itself.

How to Use This Checklist in Practice

A checklist only helps if it changes behavior. A few suggestions for making it stick:

  • Run it as a gate, not a suggestion. Do not start training until the pre-training section is fully checked, and treat any unchecked item as a documented, accepted risk.
  • Assign owners. Each section needs someone accountable, so provenance and decontamination do not fall through the cracks.
  • Revisit it every cycle. As you iterate on the dataset, re-run the relevant sections rather than assuming earlier checks still hold.

Used this way, the checklist becomes a lightweight quality system rather than a document you fill out once and forget.

Frequently Asked Questions

Which checklist item should never be skipped?

Decontaminating the test set against training data. Every quality metric you produce depends on it. Skip it and your scores become fiction, and you will make shipping decisions based on numbers that do not reflect real performance. It is the cheapest insurance against the most expensive mistake.

Can I skip the handwritten examples if I have a lot of data?

No. Handwritten examples define what good output looks like and give you a first evaluation set. Having lots of data makes them more important, not less, because they tell you what to keep and how to judge the result. They take an afternoon and prevent collecting blind.

How formal does provenance logging need to be?

Not very. A metadata file or spreadsheet recording source, date, and usage rights for each batch is enough. The discipline of logging at collection time is what matters, not the sophistication of the tool. The goal is to never be unable to answer where a dataset came from.

What if I cannot check the bias-audit item confidently?

Then treat it as an open risk and test the model specifically on the cases you suspect are underrepresented. Bias rarely shows in aggregate metrics, so an unchecked audit item means you are likely shipping invisible failures. Targeted collection closes the gaps you find.

Is this checklist different for fine-tuning versus pretraining?

The principles hold for both, but emphasis shifts. Pretraining tolerates more volume and noise and leans harder on filtering at scale. Fine-tuning rewards small, clean, carefully labeled datasets, so the labeling and curation items carry even more weight there.

Key Takeaways

  • Before collecting, define the behavior, write ideal examples, map sources, and review rights.
  • During collection, log provenance, handle personal data lawfully, and stay conservative on scope.
  • After collecting, deduplicate, filter, normalize, and audit composition for balance.
  • Invest in labeling quality with clear instructions and measured annotator agreement.
  • Before training, split the data, decontaminate the test set, and seal it as your only honest signal.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification