Checklists tell you what to do. A framework tells you how to think, so you can adapt when your situation does not match the checklist. This article introduces a named, reusable model for distillation projects β the DISTILL framework β with seven stages, what each is for, and when it dominates your attention. Use it to structure a project from scratch or to diagnose one that is going sideways.
The framework is deliberately simple. Each letter is a stage, the stages run in order, and the discipline is to not skip ahead. Most failed projects fail because they jumped to training before nailing the earlier stages. For the underlying concept, see the complete guide; this article is about the reasoning model that sits on top of it.
The DISTILL Framework at a Glance
- D β Define the task and the bar
- I β Investigate cheaper alternatives
- S β Source representative data
- T β Teach from filtered teacher outputs
- I β Iterate on student size
- L β Look at slices, not averages
- L β Launch and lifecycle
Each stage has a clear exit condition. You move on only when it is met.
D β Define the Task and the Bar
Everything downstream depends on a precise task and a measurable bar. Define exactly what capability the student must reproduce, and set the metric, threshold, cost target, and latency target before you have a model.
Exit Condition
You can state the task in one sentence and name a number that means "ship it." If you cannot, you are not ready to proceed. A vague task produces a vague result.
I β Investigate Cheaper Alternatives
Distillation is not free. Before committing, investigate whether a smaller off-the-shelf model, prompt optimization, or teacher quantization already clears the bar from stage D.
Exit Condition
You have evidence that the cheaper options fall short of your targets. If one of them clears the bar, the framework tells you to stop and take it β that is a successful outcome, not a failure. The best practices article calls this the "do nothing" baseline for a reason.
S β Source Representative Data
This is the stage that determines the outcome, and the one teams shortchange most. Source prompts from real production traffic, match the distribution to live conditions, cover edge cases, and reserve a held-out test set.
Why This Stage Dominates
The student learns precisely what you show it. A wrong distribution here cannot be repaired by anything later. Budget the bulk of your project time here. The step-by-step how-to details the mechanics of building the prompt set.
Exit Condition
Your training set reflects production traffic, including rare critical categories, and a held-out set is set aside untouched.
T β Teach From Filtered Teacher Outputs
Generate the teacher's outputs across your prompt set, then filter them. The student's quality ceiling equals the quality of the outputs you keep, so verify checkable outputs and drop the wrong ones, and sample-review the open-ended ones.
Exit Condition
You have a clean teacher-output dataset you trust, and you have confirmed the teacher's terms of service permit training on its outputs. Then you fine-tune the student.
I β Iterate on Student Size
Do not guess the student size β iterate. Train your smallest plausible candidate and one step larger on the same data, then compare against the bar from stage D.
The Trade-Off This Stage Manages
Smaller saves more but risks insufficient capacity; larger preserves quality but gives back savings. Because the expensive teacher data is reusable across sizes, testing multiple candidates is cheap relative to generating the data. The examples article shows how the optimal size swings widely by task.
Exit Condition
You have the smallest student that clears the quality bar while meeting the cost and latency targets.
L β Look at Slices, Not Averages
Evaluate the chosen student against the teacher on held-out data, broken down by the segments that matter. Set a bar per critical slice and inspect the disagreements between student and teacher directly.
Why This Stage Exists
Aggregate accuracy hides failures on rare but critical categories. A student at 96 percent overall can be failing entirely on a small high-value slice. This stage catches what averages bury β the corrective to one of the most common mistakes.
Exit Condition
The student meets its bar on every critical slice, not just overall.
L β Launch and Lifecycle
The final stage spans shipping and maintaining. Launch gradually β shadow mode, then staged rollout, with a fallback to the teacher for low-confidence inputs. Then treat the student as living infrastructure: monitor for drift and re-distill when it slips or the teacher improves.
Exit Condition
There is no true exit. This stage is ongoing. A student is a snapshot of the teacher against a snapshot of your traffic, and both change. The framework loops back to S when drift demands a refresh.
When to Apply Each Stage
In a healthy project, you spend most of your time in S and T, a meaningful amount in I (iterate) and L (look at slices), and relatively little in actual training. If a project is failing, diagnose by stage: a student that is broadly weak usually has an S problem; a student that fails on specific cases usually has a T or first-L problem; a student that decays over time has a second-L problem. The framework doubles as a debugging map.
Worked Example of Stage-Based Diagnosis
Suppose your student tests at 94 percent aggregate but customers complain about a specific category. Walk the stages backward. Is that category in your held-out evaluation as its own slice? If not, your first L stage was incomplete β fix the evaluation before anything else. If it is a slice and it scores poorly, ask whether the training data covered that category adequately; if not, the problem is in S, and you oversample and retrain. If coverage was fine but the teacher's outputs for that category were wrong, the problem is in T, and you tighten filtering. Notice how the framework converts a vague complaint into a specific stage to investigate, which is exactly what saves time on a struggling project.
The discipline of naming stages also helps teams communicate. "We have a Source problem" is a precise, shared diagnosis that points everyone at the data. "The model is bad" is not actionable. A named framework gives a team a common language for where a project stands and what to fix next.
Frequently Asked Questions
How is this framework different from a checklist?
A checklist tells you what to verify; a framework tells you how to reason and where to spend your attention. DISTILL gives each stage an exit condition and a debugging role, so when a project goes wrong you can locate the stage at fault rather than re-running everything.
Can I skip the Investigate stage if I am sure?
The framework strongly discourages it. The Investigate stage either confirms distillation is warranted or saves you the entire project. Skipping it is how teams build expensive pipelines for problems a smaller model would have solved for free.
Why are there two L stages?
Because evaluation and lifecycle are distinct disciplines that both center on looking. The first L is looking at slices during evaluation; the second L is looking at live behavior over time. Separating them keeps teams from treating launch as the finish line.
Which stage causes the most project failures?
Source, the data stage. The student learns exactly what it is shown, and a wrong distribution cannot be fixed downstream. When a project produces a broadly weak student, the cause is almost always in the data, not the training.
Does the framework apply to non-language models?
Yes. The stages are model-agnostic β define, investigate, source, teach, iterate, look, launch apply to image, audio, and ranking models as readily as to language models. Only the specific techniques inside Teach and Iterate vary by modality.
Key Takeaways
- DISTILL is a seven-stage framework: Define, Investigate, Source, Teach, Iterate, Look, Launch.
- Each stage has an exit condition; move on only when it is met, and never skip ahead to training.
- The Source and Teach stages β data and teacher filtering β deserve the bulk of your effort.
- The framework doubles as a debugging map: broad weakness points to Source, specific failures to Teach or evaluation.
- Launch is not an end; the final stage loops back to Source when drift demands re-distillation.