Search the topic and you find a thousand articles that define model distillation in the first sentence and then say nothing useful for the next two thousand words. This is not that. Below are the questions people actually type, ask in Slack, and raise in architecture reviews, answered in the order they tend to come up. No throat-clearing.
Model distillation, for the record, is a method where a smaller "student" model is trained to reproduce the behavior of a larger "teacher" model. Hold that one sentence in your head and every question below becomes easier to reason about.
If you are brand new to the subject, What Is Model Distillation: A Beginner's Guide builds the intuition slowly. This piece assumes you want answers fast.
What Problem Does Distillation Actually Solve?
The core problem is that the most capable models are large, slow, and expensive to serve. Distillation lets you capture most of a big model's useful behavior in a small model you can afford to run at scale.
The three jobs it does well
- Cut inference cost by replacing a large served model with a small one for high-volume, narrow tasks.
- Reduce latency so user-facing features respond in tens of milliseconds instead of seconds.
- Own a stable artifact instead of depending on a third-party API that changes without notice.
If none of those three pains apply to you, distillation is probably a solution looking for a problem. Be honest about which one you are solving before you start.
How Is It Different From Just Using a Smaller Model?
A smaller off-the-shelf model is trained on generic data and gives you generic quality. A distilled student is trained on your teacher's outputs, often on your task distribution, so it punches above its weight on the specific thing you care about.
The difference is targeting. Picking a small base model is like hiring a generalist. Distilling is like having your best senior person sit with that generalist for a month and teach them exactly how the work gets done. Same headcount, very different output on the job that matters.
What Does the Process Look Like End to End?
People underestimate how much of distillation is data work and how little is exotic machine learning. The skeleton is consistent.
The standard sequence
- Pick the teacher that already does the task well enough to imitate.
- Choose the student architecture and size based on your latency and cost budget.
- Generate distillation data by running representative inputs through the teacher and capturing outputs.
- Train the student on those teacher outputs, with held-out data set aside.
- Evaluate against a test set the student never saw, comparing it to both the teacher and your quality bar.
- Iterate on data coverage, because the first pass almost always has gaps.
For a fuller treatment that turns this into something you can hand to a teammate, see Building a Repeatable Workflow for What Is Model Distillation.
How Much Quality Do You Lose?
There is no universal number, and anyone who quotes one without context is guessing. The honest answer is that loss depends almost entirely on three things: how narrow your task is, how good your data coverage is, and how large the capability gap is between teacher and student.
Narrow task, dense coverage, modest size reduction: you can land very close to the teacher. Broad task, thin coverage, aggressive size reduction: expect a real drop. The biggest losses are rarely uniform; they concentrate in edge cases and rare inputs that did not appear in your distillation data. That is why per-segment evaluation beats a single aggregate score every time.
Can You Distill From a Model You Do Not Own?
Technically, yes, through black-box distillation, where you only need the teacher's text outputs rather than its internal weights. You query the teacher, collect completions, and train on them.
Legally is a different question. Many frontier providers explicitly prohibit using their model outputs to train competing models in their terms of service. The mechanism works regardless of permission, so the binding constraint is almost always the license, not the technique. Read the terms before you collect a single token, and document that you did.
What Are the Most Common Ways It Goes Wrong?
The failures cluster into a short list, and almost all of them are data or evaluation problems rather than training problems.
The usual suspects
- Coverage gaps: the distillation data missed a whole category of input, so the student fails on it in production.
- Teacher errors baked in: the student faithfully reproduces the teacher's mistakes and even amplifies them.
- Evaluation theater: the team tests on data too similar to the training set and ships an overconfident model.
- Wrong success metric: optimizing for average quality while the business actually cares about worst-case behavior.
Each of these has a known countermeasure. We catalog them with fixes in 7 Common Mistakes with What Is Model Distillation.
When Should You Not Distill?
Distillation is overhead. If your traffic is low, your latency budget is generous, and the API cost is trivial, just call the big model and move on. Building a distillation pipeline to save a few dollars a month is a classic engineering vanity project.
You should also avoid it when your task is genuinely open-ended and unpredictable. Distillation shines on bounded, repetitive tasks. The more your inputs sprawl across every possible domain, the harder it is to cover them, and the more a small student will disappoint. Match the technique to the shape of the work, a theme developed in A Framework for What Is Model Distillation.
What Tools Do You Need?
Less than you would think. You need a way to query the teacher at volume, a training framework for the student, a data store for the generated examples, and an evaluation harness. The training stack is standard fine-tuning tooling. The harder, less glamorous parts are data pipelines and evaluation. For a survey of what fits where, see The Best Tools for What Is Model Distillation.
Frequently Asked Questions
How long does a distillation project take?
A focused proof of concept on a narrow task can be done in one to two weeks, most of which is data generation and evaluation rather than training. A production-grade distillation with broad coverage, monitoring, and a retraining pipeline is a multi-month effort. The training itself is rarely the bottleneck.
Do I need a GPU cluster?
For training a small student on a moderate dataset, a single capable GPU or a modest cloud instance is often enough. Generating distillation data from a large teacher can be more demanding if the teacher is self-hosted, but if your teacher is an API, you offload that cost to the provider entirely.
Can the student ever beat the teacher?
On the specific narrow task, sometimes yes, because focused training and curated data can sharpen behavior the generalist teacher spreads thin. Across the teacher's full range of capabilities, no. The student only knows what you showed it.
Is distillation reversible or updatable?
You cannot un-distill a model, but you can keep improving a student by generating new data, especially for the cases where it fails, and retraining. Treating distillation as an ongoing loop rather than a one-time event is what separates durable systems from brittle ones.
Does distillation help with hallucinations?
Not inherently. A student inherits the teacher's tendency to hallucinate and can make it worse if the generated data contains confident errors. Reducing hallucination requires the same disciplines as any other model: grounding, retrieval, and tight evaluation.
Key Takeaways
- Distillation exists to cut cost, cut latency, or give you a stable owned artifact. Pick your reason first.
- The biggest lever is data coverage, not training cleverness.
- Black-box distillation works on closed teachers, but licenses, not mechanics, are the real constraint.
- Most failures are coverage gaps, inherited teacher errors, or weak evaluation.
- Skip distillation entirely for low-traffic, generous-latency, open-ended tasks.
- Treat it as an ongoing loop, retraining on failure cases, not a one-shot project.