Ad-hoc data collection produces ad-hoc results. The teams that consistently ship strong models are not necessarily smarter; they run a repeatable process, while everyone else improvises and rediscovers the same mistakes. This article introduces a named, reusable framework you can apply to any collection effort, from a weekend fine-tune to a large pipeline.
We call it the SCALE framework: Specify, Collect, Audit, Label, Evaluate. The name is a mnemonic, not magic. What matters is that the five stages run in order, each one gates the next, and you can drop back a stage when a later step exposes a problem. Below, each stage covers what it does, why it gates the next, and when to apply it.
Stage 1: Specify
Everything starts with specification. Before touching data, write the target behavior in one sentence and draft ideal input-output examples by hand.
Why It Gates Everything
You cannot collect, audit, or label data well without knowing what the model should do. Specification turns a vague ambition into a concrete target. The handwritten examples double as your first evaluation set, so this stage also seeds the final one.
Apply this stage in full every single time, even on small projects. Skipping it is the root cause of most wasted collection effort, as detailed in our common mistakes article.
Stage 2: Collect
With the target defined, gather data from the smallest set of sources that could cover it. Rank sources by quality and by how clear your rights are: first-party highest, then licensed, then public web, with synthetic data reserved for gaps.
The Provenance Rule
Log source, date, and usage rights for every batch as you collect. This is part of the stage, not an optional add-on. Collection without provenance is collection you cannot defend or reproduce later.
Apply judgment on scope here. For most applied projects, collect conservatively and curate hard. For foundation-model pretraining, scale matters more and the stage looks different. The examples article shows how collection varies across system types.
Stage 3: Audit
Before investing in labels, audit the raw collected data. This stage answers two questions: is the data clean, and is it balanced?
- Clean: deduplicate, filter out junk and harmful content, normalize formatting.
- Balanced: break the data down by relevant categories and find thin or missing groups.
Audit gates labeling because labeling skewed or dirty data is expensive waste. If the audit reveals gaps, drop back to Collect and gather specifically for them rather than padding with more of what is abundant. This back-and-forth between Collect and Audit is the framework working as intended.
Stage 4: Label
If your task needs labels, this is where you create them, and the framework treats labeling as a first-class stage rather than a footnote because labels cap the model's ceiling.
Running the Stage Well
- Write instructions with concrete edge-case examples.
- Label a sample yourself first to find the ambiguity.
- Measure annotator agreement on a shared subset.
- Route disagreements back into sharper instructions.
Low agreement means you drop back to refine instructions before scaling. Labeling gates evaluation because contradictory labels make any later metric meaningless. The best practices article goes deep on running this stage.
Stage 5: Evaluate
The final stage closes the loop. Split the data, decontaminate the test set against training, seal it, and measure the model against both the held-out set and your Stage 1 handwritten examples.
Why Evaluate Feeds Back
When results are weak, Evaluate tells you which earlier stage to revisit. Poor coverage sends you back to Collect. Inconsistent output sends you back to Label. Inflated-then-collapsing scores mean contamination slipped through. The framework is a loop, not a line: Evaluate is where you learn what to fix, and the highest-leverage fix is almost always better data, not a different model.
Apply all five stages on every serious project. On tiny experiments you can run them lightly, but never skip Specify or the decontamination step in Evaluate. Those two are the load-bearing walls. For the sequential, hands-on version of this loop, see the step-by-step guide.
How the Stages Gate Each Other
The reason SCALE is a framework and not just five steps is the gating between stages. Each stage produces something the next one depends on, and a failure downstream sends you back upstream.
- Specify gates Collect. Without a defined behavior, you cannot know what to gather.
- Collect gates Audit. You cannot audit composition without data and its provenance.
- Audit gates Label. Labeling dirty or skewed data is expensive waste, so you clean and balance first.
- Label gates Evaluate. Contradictory labels make any metric meaningless, so labels must be consistent before you measure.
- Evaluate feeds all of them. Weak results point you to the specific earlier stage to revisit.
This structure is what keeps a team from grinding forward on a broken foundation. When something is wrong, the gates tell you where to look.
Applying the Framework at Different Scales
SCALE flexes to the size of the work, but the stages never disappear entirely.
For a small fine-tuning project, the stages are lightweight. Specify is an afternoon of writing examples, Collect is a data export, Audit is a manual review, Label is a small careful effort, and Evaluate is a held-out slice. The whole loop might run in days.
For a large pipeline, each stage becomes its own workstream with dedicated tooling. Collect operates at scale with crawlers, Audit uses automated quality classifiers, and Label may involve a managed annotation workforce. The tools article surveys what each stage needs at scale.
The point of a framework is that the same map guides both. You scale the effort inside each stage without ever skipping a stage, which is exactly why a reusable model beats improvising from scratch every time.
Frequently Asked Questions
Do I have to run all five stages every time?
Run all five on any serious project, scaling effort to the stakes. On small experiments you can run them lightly, but never skip Specify or the decontamination step inside Evaluate. Those two prevent the most common and most expensive failures, so they earn their place even on quick work.
What makes this a framework rather than just a list of steps?
The stages gate each other and you loop back when a later stage exposes a problem. A weak audit sends you back to collect; weak evaluation tells you whether to revisit collection or labeling. That feedback structure, not the order alone, is what makes it a reusable model rather than a checklist.
Where do most teams break the framework?
At the Audit-to-Collect feedback loop. When the audit reveals a coverage gap, the disciplined move is to collect specifically for it, but teams often pad with whatever data is easy instead. That deepens the imbalance and produces a model that fails invisibly on underrepresented cases.
How does this framework handle synthetic data?
Synthetic data lives in the Collect stage as a gap-filler, reserved for cases real data cannot cover. Evaluate then verifies it actually helps. If a synthetic addition does not improve results against the held-out set, the framework says cut it, because it adds the generator's quirks for no benefit.
Can this framework apply to pretraining a foundation model?
The stages still apply, but the emphasis shifts. Collect dominates and operates at massive scale, Audit leans on automated filtering, and Label may be minimal. For the applied fine-tuning and product-building most practitioners do, all five stages carry real weight.
Key Takeaways
- The SCALE framework runs five gated stages: Specify, Collect, Audit, Label, Evaluate.
- Specify anchors everything; write the behavior and ideal examples before collecting.
- Collect conservatively with provenance logged for every batch.
- Audit and Label gate each other and the evaluation; loop back when a stage exposes a problem.
- Evaluate closes the loop and points you to the data fix, which beats a model fix almost every time.