Reading about training data is one thing. Actually assembling a dataset that produces a working model is another. This article is the operational version: a sequence of steps you can follow from a blank slate to a trained-ready dataset, with the decisions you face at each point spelled out.
We will assume you are building or fine-tuning a focused model rather than pretraining a foundation model from scratch, because that is the situation almost every practitioner is actually in. The steps scale up, but the version below is what a small team can execute.
Step 1: Define the Behavior Before You Collect Anything
The most common mistake is collecting data first and asking what it is for later. Reverse that. Write a one-sentence description of exactly what you want the model to do, then write five to ten example inputs and the ideal outputs by hand.
Those handwritten examples are your specification. They tell you what kinds of data you need, what good output looks like, and how you will eventually measure success. If you cannot write good examples yourself, you are not ready to collect data yet.
Step 2: Map Your Sources
Now identify the smallest set of sources that could cover the behavior you defined. Rank them by quality and by how clear your right to use them is.
- First-party data you already own ranks highest: clear rights, high relevance.
- Licensed datasets come next when you can afford them.
- Public web data is abundant but carries copyright and terms-of-service risk.
- Synthetic data fills gaps but should be used deliberately, not as a crutch.
Write down where each source lives and how you will access it. This map becomes your provenance record, which you will be glad to have later.
Step 3: Collect, and Log Provenance as You Go
Start pulling data. The mechanics depend on the source: an export from your own database, a crawler for web pages, or an annotation task for new examples.
Whatever the source, record three things for every batch: where it came from, when you collected it, and what rights you have to use it. Do this at collection time. Reconstructing provenance after the fact is painful and often impossible, and missing provenance is a frequent failure mode covered in our common mistakes breakdown.
Step 4: Clean the Raw Data
Raw data is never ready. Run it through a cleaning pass in roughly this order:
- Deduplicate. Remove exact and near-duplicate documents.
- Filter. Drop low-quality, off-topic, and harmful content using heuristics and classifiers.
- Normalize. Fix encoding, standardize formatting, detect and tag languages.
- Decontaminate. Remove anything that overlaps with the test set you will use to evaluate.
Treat cleaning as the main event, not a chore. A clean small dataset beats a dirty large one almost every time.
Step 5: Label or Structure the Data
If your task needs labels, this is where you create them. Write unambiguous instructions, label a sample yourself first, then hand it to annotators with those examples attached.
Quality Control for Labels
- Have multiple people label a subset and measure how often they agree.
- Spot-check randomly, not just the first rows.
- Feed disagreements back into clearer instructions.
Inconsistent labels are a silent killer. If two annotators interpret the task differently, the model learns the contradiction. Our best practices article goes deeper on keeping label quality high at scale.
Step 6: Split and Validate
Before training, split the data into training, validation, and test sets. The test set must stay untouched until final evaluation, and it must not overlap with training data. This is what makes your eventual quality numbers trustworthy.
Take a final pass to confirm balance. Are some categories overrepresented? Are rare but important cases present at all? Adjust by collecting more of what is missing rather than padding with what is easy.
Step 7: Evaluate, Then Iterate on Data
Train, then test against your held-out set and your handwritten examples from Step 1. When results are weak, resist the urge to change the model first. In most focused projects, the highest-leverage fix is better data: clearer labels, more coverage of failure cases, less noise.
This loop, collect, clean, evaluate, improve the data, is the real work. For the conceptual model that ties these steps together, see the framework article.
A Worked Mini-Example
To make the sequence concrete, imagine you are building a model that classifies incoming product feedback as a bug report, a feature request, or praise.
- Step 1, define: You write that the model should read a message and output one of three labels, and you hand-write a dozen examples, including a few tricky ones that mix praise with a bug.
- Step 2, map: Your support inbox is the obvious first-party source with clear rights. You decide not to scrape public forums because the language differs and the rights are murky.
- Step 3, collect: You export six months of messages and log the export date and source.
- Step 4, clean: You drop duplicates, remove automated messages, and scrub personal details.
- Step 5, label: You and a colleague each tag a shared sample, discover you disagree on mixed messages, and sharpen the rule before tagging the rest.
- Step 6, split: You hold out a recent slice as a test set and confirm none of it appears in training.
- Step 7, evaluate: The model struggles with mixed messages, so you collect and label more of those specifically.
That last move, fixing a weakness with targeted data rather than a bigger model, is the habit that separates teams that ship from teams that stall.
What to Do When Results Stall
When your evaluation numbers plateau, run through this short diagnostic before touching the model:
- Is the data clean? Re-check for duplicates and noise that crept in.
- Is coverage complete? Look for categories the model fails on and collect for them.
- Are labels consistent? Re-measure annotator agreement on the failing cases.
- Is the test set honest? Confirm no contamination is inflating or distorting your read.
In the large majority of applied projects, the answer to a stalled model lives in one of those four data questions, not in the architecture.
Frequently Asked Questions
How many examples do I need to start?
For fine-tuning a focused behavior, you can often see meaningful results with a few hundred to a few thousand high-quality examples. Start small, evaluate, and add more only where the model is weak. Beginning with a massive dataset before you know it works wastes effort.
Should I collect data or generate it synthetically?
Start with real data because it grounds the model in reality. Use synthetic data to fill specific gaps, such as rare cases your real data lacks. Relying entirely on synthetic data risks teaching the model the quirks of whatever generated it rather than the real task.
What is the most important step people skip?
Logging provenance at collection time and decontaminating against the test set. Both are easy to postpone and nearly impossible to fix later. Skipping provenance creates legal exposure, and skipping decontamination produces inflated scores that collapse in production.
How do I know my dataset is good enough?
Your dataset is good enough when the model passes your handwritten examples and your held-out test set, and when adding more data stops improving results. If quality plateaus, the issue is usually coverage or label consistency, not volume.
Can I reuse a dataset for a different task?
Sometimes, but be careful. A dataset built for one behavior may lack coverage for another and may carry labels that mean something specific to the original task. Always re-evaluate balance and relevance before reusing data rather than assuming it transfers.
Key Takeaways
- Define the target behavior and write example inputs and outputs before collecting anything.
- Map your sources by quality and by how clear your rights are, then log provenance as you collect.
- Cleaning, deduplicating, filtering, and decontaminating is the main event, not an afterthought.
- Write unambiguous labeling instructions and measure annotator agreement.
- Keep a held-out test set untouched and iterate on the data, not the model, when results are weak.