This is the hands-on version. If you understand what distillation is and want a sequence you can actually follow, this article gives you the steps in order, with the decision you make at each one. We focus on sequence-level distillation for language models, because that is what most teams build today: you generate teacher outputs through an API and fine-tune a smaller student on them.
A quick framing before we start. Distillation is a project, not a button. Expect to spend most of your time on data — building the prompt set and cleaning teacher outputs — not on training. Teams that rush the data and over-invest in training configuration usually end up redoing the data anyway.
If you are still fuzzy on the concept, read the beginner's guide first. Otherwise, here is the sequence.
Step 1: Define the Task and the Quality Bar
Before anything else, write down two things: exactly what the student must do, and how good it must be to ship.
Make the Task Narrow
"Summarize legal contracts into five-bullet risk summaries" is a good task. "Be a helpful legal assistant" is not. The narrower the task, the smaller the student can be and the higher the quality you can preserve. Distillation transfers a capability, so the capability has to be definable.
Set a Measurable Bar Now
Decide what "good enough" means before you have a model, so you cannot rationalize a weak result later. Pick a metric — exact match, an LLM-as-judge score, human ratings on a sample — and a threshold. Also set your cost and latency targets. If you cannot state the bar, you cannot tell when you are done.
Step 2: Build a Representative Prompt Set
The single biggest driver of student quality is whether your training prompts match real production traffic.
- Pull real inputs from logs if you have them. Synthetic prompts are a fallback, not a first choice.
- Cover the edge cases. The student will be sharp exactly where your prompts are dense and dull where they are sparse.
- Match the production distribution. If 30 percent of real traffic is one category, roughly 30 percent of your prompts should be too.
A few thousand well-chosen prompts often beats tens of thousands of generic ones. Quality and coverage beat raw volume.
Step 3: Generate Teacher Outputs
Now run your prompt set through the teacher and capture its responses. This is the expensive step — you are paying the teacher's full price once.
Capture Everything Useful
Save the full output, and if the API exposes them, the token-level probabilities or logprobs. For pure sequence-level distillation you can work from text alone, but logprobs let you use a richer loss if your training setup supports it.
Filter the Teacher's Output
The teacher will make mistakes. Decide your policy now: drop low-confidence outputs, run a verification pass, or keep everything and accept some noise. For tasks with a checkable answer, filter aggressively — a smaller clean dataset beats a larger noisy one. Distilling unfiltered teacher errors is one of the most common mistakes and it quietly caps your student's ceiling.
Step 4: Choose the Student Architecture
Pick the base model the student starts from. This is a balance between capacity and cost.
- Too small and the student physically cannot hold the capability, no matter how good your data.
- Too large and you give back the cost savings that justified the project.
- Start with the smallest base you believe could plausibly work, train it, and only size up if it falls short of the bar.
A good default is to try two sizes — your best guess and one step larger — and compare. The data you built in steps 2 and 3 is reusable across both.
Step 5: Train the Student
With clean data and a base model, training is the most mechanical part.
- Hold out a test set the student never trains on, drawn from the same distribution.
- Fine-tune the student on the teacher outputs. For sequence-level distillation this is standard supervised fine-tuning on the teacher's generations.
- Watch for overfitting. If train quality climbs while held-out quality stalls, stop early.
- Keep checkpoints so you can roll back to the best one.
Resist the urge to over-tune hyperparameters before your data is solid. A clean dataset with default settings beats a noisy dataset with a perfectly tuned schedule.
Step 6: Evaluate Against the Teacher
Aggregate accuracy is not enough. You need to know where the student and teacher disagree.
Measure the Gap, Not Just the Score
Run both models on the held-out set and compute the student's quality relative to the teacher. Then inspect the disagreements directly. A student at 95 percent overall might be failing entirely on one critical category while acing the easy ones. That hidden failure is invisible in the aggregate number.
Check the Production-Critical Slices
Break results down by the segments that matter for your application. If one customer tier or input type is business-critical, evaluate it separately. Our best practices article goes deeper on slice-based evaluation.
Step 7: Ship Behind a Safety Net
Do not flip the whole system to the student on day one.
- Run the student in shadow mode first — serve teacher results to users while logging student results — and compare on live traffic.
- Roll out gradually: 5 percent, then 25 percent, then full.
- Keep a fallback to the teacher for inputs where the student is uncertain or low-confidence.
- Monitor for distribution drift. When real traffic shifts away from your training set, student quality erodes and it is time to re-distill.
Instrument From Day One
Set up the monitoring before the rollout, not after. Log the student's confidence on every request, the rate at which inputs fall back to the teacher, and the latency at the tail. Track per-slice quality on live traffic where you have a ground-truth signal or a sampling process. These metrics are how you know the rollout is healthy and how you catch the slow decay that comes with drift. A student that is shipped without instrumentation is a student you are flying blind on, and the first sign of a problem will be a user complaint rather than a dashboard.
Frequently Asked Questions
How long does a distillation project take?
For a narrow, well-scoped task with good logs, a first working student can come together in days. The variable is data quality. Teams with messy or missing input data spend most of their time there, and that is time well spent.
Do I need the teacher's internal weights?
No. Sequence-level distillation needs only the teacher's outputs, which you can collect through an API. Access to weights or full probability distributions enables richer methods but is not required for a solid result.
What if my student misses the quality bar?
Diagnose before you give up. Usually it is a data problem — missing coverage, noisy teacher outputs, or distribution mismatch — not a model-size problem. Fix the data first, then consider a larger student base.
Should I re-run distillation over time?
Yes, periodically. As production traffic drifts and the teacher improves, your student goes stale. Treat distillation as a maintained pipeline, not a one-time event, especially for tasks where inputs change.
Can I distill from multiple teachers?
You can, and it sometimes helps. Combining outputs from several strong teachers can produce a more robust student, though it adds complexity and cost to the data-generation step. Start with one teacher and add others only if a single one falls short.
Key Takeaways
- Define a narrow task and a measurable quality bar before you touch any model.
- Build a prompt set that matches production traffic — coverage and distribution matter more than volume.
- Generate teacher outputs, then filter them; clean data beats large noisy data.
- Start with the smallest plausible student and size up only if it misses the bar.
- Evaluate the student against the teacher on slices, not just aggregate, and ship gradually behind a fallback.