A checklist is only useful if you trust it enough to actually stop and run it. So this one earns its keep: every item comes with a short reason, because a checklist you follow blindly is just busywork, and a checklist you understand is a quality system. Pilots use checklists not because they forget how to fly, but because the cost of forgetting one item is catastrophic. Labeling has the same property; the forgotten step poisons everything downstream.
Use this data labeling and annotation basics checklist as a living working tool. Copy it, adapt the items to your task, and run the phases in order. Skip an item only when you can state out loud why it does not apply to you.
The list is organized into three phases: before you label, while you label, and after you label.
Phase 1: Before You Label
This phase is where projects are won. Most labeling disasters trace to something skipped here.
- Write the prediction question in one sentence. If you cannot, the schema will inherit the fuzziness. Everything downstream depends on a sharp objective.
- Define every label with a one-sentence rule. Vague categories produce inconsistent labels no matter how good the annotators are.
- Document at least three edge cases with explicit rulings. The hard cases cause all the disagreement; resolve them in writing first.
- Confirm categories are mutually exclusive, or deliberately switch to multi-label. Overlapping categories are an accuracy ceiling waiting to happen.
- Sample representatively, including weird and rare examples. A clean sample gives false confidence.
The reasoning behind this sequencing is in our Step-by-Step Approach to Data Labeling and Annotation Basics.
The thread connecting these five items is that each one is cheap to do now and expensive to skip. Writing a sentence costs minutes; discovering a fuzzy objective after labeling ten thousand examples costs weeks. Defining labels costs an afternoon; merging or splitting categories mid-project costs a re-label pass. Treat this phase as the place where you buy insurance against the most expensive failures, paid in hours rather than weeks.
Phase 2: While You Label
The labeling itself is where consistency erodes if you are not watching.
- Run a multi-labeler pilot first. Disagreements map every weakness in your schema before you commit at scale.
- Measure inter-annotator agreement. Low agreement means the task is ambiguous, which no volume of labeling fixes.
- Seed invisible gold examples. They give a continuous accuracy signal and catch drift before it spreads.
- Provide a frictionless "flag for review" path. Forcing a guess on ambiguous examples hides confusion that should become a guideline.
- Oversample rare classes deliberately. A model that has seen six examples of a class cannot learn it.
Skipping the pilot is the most common and most expensive omission here, as detailed in our Seven Ways Teams Quietly Poison Their Training Data.
Each of these five items is a feedback loop, not a one-time gate. The pilot tells you whether the schema is ready; agreement tells you whether it stays ready; gold examples tell you whether individuals are holding the line; the flag path feeds new edge cases back into the guidelines; oversampling keeps the class balance honest as you go. Phase 2 is less a list to check off than a set of dials to keep watching while the work runs, and the moment you stop watching is the moment quality starts slipping unnoticed.
A note on throughput targets
If you set an examples-per-hour target without a quality gate beside it, annotators optimize for speed and quietly trade away accuracy. Always pair throughput with an accuracy metric.
People deliver what you measure. Measure only speed and you will get speed at the expense of everything else, not because annotators are cynical but because that is the signal you sent. The fix is to make the accuracy number as visible and as consequential as the speed number, so the two stay in balance. A dashboard that shows both side by side does more for quality than any amount of exhortation to "be careful."
Mid-project, re-pilot when the data shifts
If the nature of your incoming data changes partway through, a new source, a new language, a new format, run a fresh mini-pilot on the new slice before continuing. Guidelines calibrated on the old data may not cover the new cases, and assuming they do is how a clean project quietly degrades in its second half.
Phase 3: After You Label
The work is not done when the last example is tagged.
- Run a cold audit. Have an expert re-label a random sample blind and compare. This catches drift and schema rot.
- Document the audit accuracy. Your engineers deserve to know the quality they are training on, and you want the baseline for next time.
- Version the final guidelines. When a future retrain shifts performance, you need to know what changed.
- Check class balance in the final set. A lopsided dataset trains a model that ignores the minority class.
- Archive disputed cases with their resolutions. They are the seed of next round's guidelines.
- Record who labeled and reviewed what. Provenance lets you trace a quality problem back to its source instead of guessing.
Turn this list into your own tool
The checklist above is a starting template, not gospel. The most valuable version of it is the one you adapt to your specific task, adding the items that matter for your domain and pruning the ones that genuinely do not apply. A team labeling medical images will add regulatory and privacy items; a team tagging blog categories will not. Copy this into your project wiki, edit it, and revisit it after each project to fold in whatever you learned. A checklist that never changes is one nobody is actually using.
These post-labeling habits are expanded in Labeling Habits That Separate Good Datasets From Lucky Ones, and the foundational logic is in Why Your Model Is Only as Smart as Its Labels.
The temptation in Phase 3 is to declare victory the moment the last example is tagged and move on to training. Resist it. The audit and documentation steps cost a fraction of the labeling effort but determine whether the next person, including future you, can trust and reproduce the work. A dataset shipped without an audit number or versioned guidelines is a liability the moment anyone needs to retrain or debug.
Frequently Asked Questions
Which items on this list should never be skipped?
The one-sentence prediction question, the documented edge cases, the pilot, and the cold audit. These four prevent the failures that cost the most to fix later. Everything else is adjustable; these are load-bearing.
Can a solo labeler use this checklist?
Yes. Replace the multi-labeler pilot with a self-consistency check: label a batch, set it aside, re-label it days later, and compare. The rest of the items apply unchanged to one person.
How is this different from just following good practices?
A checklist forces the practices to happen at the right moment, in order, even under deadline pressure. Knowing the practices is not the same as executing them when you are rushed, which is exactly when items get dropped.
What does the cold audit actually catch?
Drift, schema rot, and creeping inconsistency that a person reviewing their own work cannot see. Because the auditor labels blind, their disagreements reveal real problems rather than confirming existing assumptions.
Should throughput targets ever be used at all?
Yes, but always paired with an accuracy gate. A speed target alone trains annotators to trade quality for volume. Together, the two keep pace and correctness honest.
Key Takeaways
- Run the checklist in three phases: before, during, and after labeling.
- The four never-skip items are the prediction question, documented edge cases, the pilot, and the cold audit.
- Pair any throughput target with an accuracy gate so speed never silently buys inaccuracy.
- Seed gold examples and offer a flag-for-review path to keep quality visible.
- Document and version the final guidelines and audit accuracy for the next retrain.