DRAFT: The Five Stages That Recur in Every Labeling Project

Checklists tell you what to do. Frameworks tell you how to think, which is more durable when your specific task does not match anyone's template. After enough labeling projects, the same five stages keep reappearing regardless of domain, whether you are boxing images, tagging text, or moderating content. Naming them turns scattered habits into a model you can teach, repeat, and loop through deliberately.

This piece introduces DRAFT: Define, Rule, Audit, Flag, Track. It is not a rigid pipeline you run once. It is a cycle you move through, looping back whenever a later stage reveals a weakness in an earlier one. Used well, this data labeling and annotation basics framework keeps a project honest from first sample to final retrain.

Let us walk each stage and, just as importantly, when to apply it.

D: Define

Every project starts by defining the prediction question and the label schema. This is the stage that determines the ceiling for everything else.

Write the one-sentence question the model will answer, then enumerate the labels with crisp definitions. If two categories overlap, decide now whether to merge them or go multi-label. Defining badly here caps your accuracy no matter how well you execute the later stages.

When to apply: at the start, and again whenever the Flag stage reveals a structural hole in the schema. The deeper logic is in Why Your Model Is Only as Smart as Its Labels.

The signature of a Define problem is disagreement that no rule can resolve, because the categories themselves overlap or a needed category is missing. If your annotators keep flagging the same kind of example and every proposed rule feels arbitrary, the issue is not the rule, it is the schema. That is your signal to return to Define rather than patch Rule, and distinguishing the two saves enormous wasted effort.

R: Rule

Definitions handle the easy examples. Rules handle the hard ones. In this stage you run a pilot, find the disagreements, and convert each into an explicit guideline.

Rules come from disagreement

You do not write good rules by imagining edge cases at a desk. You write them by piloting, watching where annotators clash, and ruling on each clash. This is why the pilot is non-negotiable, a point our Step-by-Step Approach to Data Labeling and Annotation Basics drives home.

When to apply: after Define and before scaling, and whenever Track shows agreement slipping.

A: Audit

Auditing measures whether the labels are actually correct against a trusted standard. Pull a random sample, have an expert label it blind, and compare.

The cold, blind nature matters. An auditor who sees the existing labels confirms assumptions instead of testing them. A genuine cold audit is the single best catch for drift and schema rot.

When to apply: before every retrain, without exception. The failures this prevents are catalogued in Seven Ways Teams Quietly Poison Their Training Data.

Audit is distinct from Track in a way worth being precise about. Track watches trends continuously and cheaply, like agreement and gold accuracy. Audit is a deeper, periodic, blind re-labeling that establishes real ground truth. Track tells you something might be wrong; Audit tells you what and how wrong. You need both, because continuous tracking without occasional deep audits drifts into measuring the wrong thing confidently.

F: Flag

Flagging is the channel through which annotator confusion becomes visible. Give labelers a frictionless way to mark an example as ambiguous instead of silently guessing.

Flagged examples are not failures; they are your guideline backlog. A project where nothing is ever flagged is not confident, it is hiding uncertainty that will surface later as inconsistency. Each flag, once resolved, feeds back into the Rule and sometimes the Define stage.

When to apply: continuously, throughout all labeling.

T: Track

Tracking is the measurement layer that runs under everything. You track inter-annotator agreement, gold accuracy over time, and class balance.

Without tracking, drift is invisible until the model misbehaves. With it, a dip in agreement or a slide in gold accuracy is an early warning that something needs attention, usually a loop back to Rule or Define.

The discipline of Track is to pick a small number of metrics and watch them over time rather than computing many metrics once. A single agreement number on day one is nearly useless; the same number plotted across weeks reveals drift the instant it begins. Trends carry the information, not snapshots, which is why Track is a continuous stage rather than a one-time measurement.

When to apply: always on, in the background. The habits that operationalize tracking are in Labeling Habits That Separate Good Datasets From Lucky Ones.

F revisited: Why Flag Is the Engine

It is worth dwelling on Flag, because it is the stage that keeps the whole cycle alive. Define, Rule, and Audit are things you do to the data. Flag is the channel through which the data, and the people closest to it, talk back to you. Without it, the framework is a one-way pipeline that cannot learn from the examples it is failing on.

A practical sign of a healthy DRAFT loop is a steady trickle of flags that gradually slows as the schema matures. A loop with zero flags is not finished; it is deaf. A loop with a rising flood of flags is telling you the schema has a structural problem that belongs back in Define. The flag rate itself is a diagnostic, and reading it is part of running the framework well.

How the Stages Loop

DRAFT is a cycle, not a line. Tracking reveals slipping agreement, which sends you back to Rule. A flag reveals a category that does not exist yet, which sends you back to Define. An audit reveals systematic error, which sends you back to Rule or Define. The framework's value is that it tells you exactly where to return when something breaks, instead of leaving you to guess.

Contrast this with the usual failure, where a team notices the model misbehaving and reacts with a scattershot of changes, more data here, a guideline tweak there, a new annotator, with no theory of which lever matters. DRAFT replaces that flailing with a diagnosis. A drop in agreement points to Rule. A missing category points to Define. A systematic audit miss points to whichever stage produced the systematic error. Naming the stages is what makes the diagnosis possible, and the diagnosis is what saves the wasted effort.

Frequently Asked Questions

Is DRAFT a sequence or a loop?

A loop with a natural starting order. You begin at Define and move through to Track, but later stages routinely send you back to earlier ones. Treating it as a one-time sequence misses its main benefit, which is knowing where to return when a problem appears.

How is DRAFT different from a checklist?

A checklist is a flat list of actions; DRAFT is a model of relationships that tells you which stage a given problem belongs to. When agreement drops, the framework points you to Rule. A checklist just lists items without explaining their connections.

Which stage do teams most often neglect?

Flag. Teams build Define, Rule, Audit, and Track but forget to give annotators a way to surface confusion. Without Flag, the framework cannot learn from the exact examples that would improve it most.

Can I use DRAFT for annotation tasks, not just classification?

Yes. Annotation tasks simply spend more iterations in Rule, because span boundaries and box conventions generate more disagreement. The five stages and their loops apply identically; the time distribution shifts.

When am I "done" with DRAFT?

You are never fully done on an active project; Track and Flag run continuously. You reach a steady state when audits stay clean and flags slow to a trickle, signaling the schema has matured.

Key Takeaways

DRAFT names five recurring stages: Define, Rule, Audit, Flag, Track.
Define sets the ceiling; rules come from piloting and resolving real disagreements.
Audits must be cold and blind to genuinely catch drift and schema rot.
Flag turns silent confusion into a guideline backlog; do not neglect it.
DRAFT is a loop, and its value is telling you which stage to return to when something breaks.

Let us walk each stage and, just as importantly, when to apply it.

D: Define

Every project starts by defining the prediction question and the label schema. This is the stage that determines the ceiling for everything else.

When to apply: at the start, and again whenever the Flag stage reveals a structural hole in the schema. The deeper logic is in Why Your Model Is Only as Smart as Its Labels.

R: Rule

Definitions handle the easy examples. Rules handle the hard ones. In this stage you run a pilot, find the disagreements, and convert each into an explicit guideline.

Rules come from disagreement

When to apply: after Define and before scaling, and whenever Track shows agreement slipping.

A: Audit

Auditing measures whether the labels are actually correct against a trusted standard. Pull a random sample, have an expert label it blind, and compare.

The cold, blind nature matters. An auditor who sees the existing labels confirms assumptions instead of testing them. A genuine cold audit is the single best catch for drift and schema rot.

When to apply: before every retrain, without exception. The failures this prevents are catalogued in Seven Ways Teams Quietly Poison Their Training Data.

F: Flag

Flagging is the channel through which annotator confusion becomes visible. Give labelers a frictionless way to mark an example as ambiguous instead of silently guessing.

When to apply: continuously, throughout all labeling.

T: Track

Tracking is the measurement layer that runs under everything. You track inter-annotator agreement, gold accuracy over time, and class balance.

When to apply: always on, in the background. The habits that operationalize tracking are in Labeling Habits That Separate Good Datasets From Lucky Ones.

F revisited: Why Flag Is the Engine

How the Stages Loop

Frequently Asked Questions

Is DRAFT a sequence or a loop?

How is DRAFT different from a checklist?

Which stage do teams most often neglect?

Flag. Teams build Define, Rule, Audit, and Track but forget to give annotators a way to surface confusion. Without Flag, the framework cannot learn from the exact examples that would improve it most.

Can I use DRAFT for annotation tasks, not just classification?

When am I "done" with DRAFT?

You are never fully done on an active project; Track and Flag run continuously. You reach a steady state when audits stay clean and flags slow to a trickle, signaling the schema has matured.

Key Takeaways

DRAFT names five recurring stages: Define, Rule, Audit, Flag, Track.
Define sets the ceiling; rules come from piloting and resolving real disagreements.
Audits must be cold and blind to genuinely catch drift and schema rot.
Flag turns silent confusion into a guideline backlog; do not neglect it.
DRAFT is a loop, and its value is telling you which stage to return to when something breaks.

DRAFT: The Five Stages That Recur in Every Labeling Project

D: Define

R: Rule

Rules come from disagreement

A: Audit

F: Flag

T: Track

F revisited: Why Flag Is the Engine

How the Stages Loop

Frequently Asked Questions

Is DRAFT a sequence or a loop?

How is DRAFT different from a checklist?

Which stage do teams most often neglect?

Can I use DRAFT for annotation tasks, not just classification?

When am I "done" with DRAFT?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

DRAFT: The Five Stages That Recur in Every Labeling Project

D: Define

R: Rule

Rules come from disagreement

A: Audit

F: Flag

T: Track

F revisited: Why Flag Is the Engine

How the Stages Loop

Frequently Asked Questions

Is DRAFT a sequence or a loop?

How is DRAFT different from a checklist?

Which stage do teams most often neglect?

Can I use DRAFT for annotation tasks, not just classification?

When am I "done" with DRAFT?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?