There is a lot of generic advice about distillation that amounts to "use good data and evaluate carefully." True, but useless. This article is the opinionated version: the practices that actually move outcomes, the reasoning behind each, and the trade-offs you only learn by getting burned. Some of these will be slightly contrarian. That is the point — platitudes do not help you ship.
The through-line is this: distillation succeeds or fails on data and evaluation, not on training cleverness. Almost every team over-invests in the training loop and under-invests in the prompt set. Reverse that priority and your results improve immediately.
Spend 80 Percent of Your Effort on Data
The most important practice is also the most boring. Your prompt set and your teacher-output filtering matter more than your architecture, your loss function, and your hyperparameters combined.
Why Data Dominates
The student learns exactly what you show it. If the data reflects production and the teacher outputs are clean, a mediocre training setup still produces a good student. If the data is wrong, a perfect training setup produces a confident, fast, wrong student. The leverage is overwhelmingly in the data, so put your best people there.
The Practical Implication
Build your prompt set from real production logs, matched to the live distribution. Filter teacher outputs before training — verify the ones you can check, sample-review the ones you cannot. The step-by-step how-to walks through this; treat it as the core of the project, not a prerequisite to the "real" work of training.
Always Run a "Do Nothing" Baseline First
Before you commit to a distillation project, prove it is the best option. Distillation has real engineering cost, and cheaper alternatives often clear the bar.
- A smaller off-the-shelf model may already be good enough with no project at all.
- Prompt optimization on the existing small model — better instructions, few-shot examples — sometimes closes the gap for free.
- Quantization of the teacher can cut cost without training a new model.
Run these baselines first. If a non-distillation option meets your quality, cost, and latency targets, take it. Distillation is justified when the simpler options genuinely fall short, not by default.
Evaluate by Slice, Not by Average
A single aggregate score hides the failures that matter. The practice that separates reliable students from fragile ones is slicing your evaluation.
What This Means in Practice
Break your held-out evaluation into the segments that matter for your application — categories, customer tiers, input types, languages. Set a quality bar per critical slice, not just overall. A student at 96 percent average can be failing entirely on a 3 percent slice that happens to be your highest-value customers. Averages reward the easy majority and bury the dangerous minority. This is the corrective practice behind one of the most common distillation mistakes.
Filter the Teacher Like It Is a Junior Employee
Treat the teacher's outputs as a strong but fallible first draft, never as ground truth. The student cannot exceed the quality of the outputs you train it on.
- For checkable tasks, verify every output and drop the wrong ones automatically.
- For open-ended tasks, sample-review and remove the obviously poor responses.
- When you can, generate multiple teacher samples per prompt and keep the best, or use a verification model as a second pass.
The instinct to trust the big model is strong and wrong. Filtering shrinks your dataset, and that is fine — clean and smaller beats noisy and larger.
Size the Student Empirically, Not by Vibes
Do not guess the student size. Test it. Pick two candidate sizes — your best guess and one step larger — train both on the same data, and compare quality against the bar.
The trade-off is direct: smaller saves more money but risks insufficient capacity; larger preserves quality but gives back savings. There is no general right answer, only the right answer for your task and your data. Since the expensive part is generating teacher outputs, and that data is reusable across student sizes, testing multiple sizes is cheaper than it sounds. The examples article shows how the optimal size varies wildly by task.
Ship Behind a Shadow and a Fallback
Never cut over to the student blind. The practice that prevents production regressions is staged rollout with a safety net.
- Run the student in shadow mode — log its outputs on live traffic while users still get the teacher.
- Compare shadow results to the teacher on real inputs, not just your offline set.
- Roll out gradually, and route low-confidence inputs back to the teacher as a fallback.
This costs a little engineering and buys you the ability to catch problems before users do. For anything user-facing, it is non-negotiable.
Calibrate the Confidence Threshold
The fallback only works if the student's confidence signal is meaningful. A student that is confidently wrong defeats the whole pattern. Spend time calibrating the threshold at which you route an input back to the teacher: measure, on your held-out set, how student accuracy varies with its own confidence. If accuracy is high only above a certain confidence level, route everything below that level to the teacher. Set the threshold too aggressively and you send too much traffic to the expensive teacher and lose your savings; set it too loosely and bad outputs reach users. This calibration is unglamorous and it is what makes the hybrid pattern actually safe.
Generate Multiple Teacher Samples for Hard Examples
A practice that pays off on difficult tasks: for prompts where the teacher's answer is uncertain or high-stakes, generate several teacher samples and keep the best, or use a verification pass to pick the strongest one. A single teacher sample on a hard prompt can be noisy; the teacher itself has variance. Sampling several times and selecting raises the quality of the signal the student learns from, precisely on the examples where quality matters most. This costs more teacher inference, so apply it selectively — to the hard or critical slices, not the whole dataset. The marginal data quality on those examples often determines whether the student clears its per-slice bar.
Treat It as a Pipeline, Not a Project
The last practice is mindset. A student is a snapshot of the teacher against a snapshot of your traffic. Both change. Build the distillation as a repeatable pipeline you can re-run cheaply, monitor the student's live quality, and refresh it when traffic drifts or the teacher improves. Teams that frame distillation as "done" watch their students decay invisibly. Teams that frame it as maintained infrastructure keep the savings without the rot.
Frequently Asked Questions
What is the single highest-leverage practice?
Building your prompt set from real production traffic at the correct distribution. Nothing else comes close. It determines whether the student is optimized for the inputs you actually serve or for a fantasy distribution that never occurs.
Is it ever wrong to distill?
Yes, often. If a smaller off-the-shelf model, prompt optimization, or quantization meets your targets, distillation is wasted engineering. Always run those baselines first and only distill when they genuinely fall short.
How clean does the teacher data need to be?
As clean as you can make it for the cost. The student's quality ceiling is set by the outputs you train on. For checkable tasks, automate verification and filter hard. For open-ended ones, sample-review and cut the worst.
Should I tune hyperparameters extensively?
Not until your data is solid. Hyperparameters are second-order. A clean, well-distributed dataset with default settings beats a noisy dataset with a perfectly tuned schedule. Fix data first, tune later, and only if it moves the metric.
How do I know when to re-distill?
Monitor the student's quality on live traffic against your bar. Re-distill when it slips, when production traffic visibly drifts from your training set, or when the teacher gets a meaningful upgrade. Make re-running cheap so this is painless.
Key Takeaways
- Put roughly 80 percent of your effort into data — prompt distribution and teacher-output filtering — not the training loop.
- Run "do nothing" baselines first; distill only when smaller models, prompt tuning, or quantization fall short.
- Evaluate by slice, not average, and set a bar for each business-critical segment.
- Filter teacher outputs hard and size the student empirically by testing candidates.
- Ship behind a shadow deployment with a fallback, and maintain distillation as a re-runnable pipeline.