Model distillation is the process of training a smaller "student" model to reproduce the behavior of a larger "teacher" model. You run the teacher on a body of inputs, capture its outputs, and use those outputs as the training signal for the student. The student ends up smaller, cheaper, and faster, while keeping most of the teacher's useful behavior on the tasks you care about.
The technique itself is not where teams get stuck. The hard decision is comparative: distillation is one of several ways to make a model cheaper or more specialized, and it competes directly with quantization, fine-tuning, pruning, and plain prompt engineering. Picking wrong wastes weeks. This article lays out the competing approaches, the axes that actually matter, and a decision rule you can apply before you spend any compute.
If you are completely new to the concept, start with What Is Model Distillation: A Beginner's Guide and come back here once the mechanics are clear.
The Approaches You Are Actually Choosing Between
Distillation rarely competes in a vacuum. When someone says "the model is too expensive," they have five real options.
Distillation versus quantization
Quantization shrinks a model by reducing numeric precision, for example from 16-bit weights to 8-bit or 4-bit. It is fast, requires no training data, and often gives you 2x to 4x cost reduction with a small accuracy hit. Distillation requires you to generate a training set and run a training job, but it can produce a genuinely smaller architecture, not just a compressed version of the original.
Rule of thumb: try quantization first. It is cheaper to attempt and you keep it even if you later distill. Reach for distillation when quantization alone does not get you small enough, or when you want a different, smaller architecture entirely.
Distillation versus fine-tuning
Fine-tuning adapts a model to your task using labeled examples. Distillation adapts a model to mimic another model. The two overlap because distillation is, mechanically, fine-tuning on teacher-generated labels. The practical difference is the source of truth: fine-tuning needs human-labeled data, distillation needs a strong teacher and unlabeled inputs.
If you have a great teacher but little labeled data, distillation wins. If you have abundant high-quality labels and no teacher worth copying, fine-tune.
Distillation versus prompting a small model directly
The cheapest option is to skip all training and just prompt a small off-the-shelf model with a good system prompt and a few examples. This costs nothing to build. The trade-off is per-request cost and consistency: distillation bakes the behavior into the weights, so you do not pay for long prompts on every call.
The Axes That Matter
Reduce the decision to these five variables and most of the noise disappears.
- Latency budget. If you need sub-100ms responses on a device, distillation to a small student is often the only path. Prompting a large model cannot hit that.
- Volume. Distillation has high fixed cost (data generation, training) and low marginal cost. At low volume it never pays back. At millions of calls a month it dominates.
- Task breadth. Distillation shines for narrow, well-defined tasks. The broader the task surface, the more capability you lose in the student and the worse the trade.
- Teacher quality. Your student can only be as good as the labels your teacher produces. A mediocre teacher caps the whole effort.
- Tolerance for degradation. Distillation always loses something. If you cannot accept a few points of quality loss on edge cases, budget for heavier evaluation or pick a different approach.
For a deeper treatment of how these factors fit together, see A Framework for What Is Model Distillation.
A Decision Rule You Can Apply Today
Here is a sequence that works in practice. Run it top to bottom and stop at the first "yes."
- Is per-call cost or latency the actual problem? If not, distillation is the wrong tool. Fix the real problem first.
- Can prompt engineering plus a smaller off-the-shelf model meet your quality bar? If yes, do that. It is days, not weeks.
- Does 8-bit or 4-bit quantization of your current model hit the cost and latency target? If yes, ship it.
- Do you have, or can you afford, a strong teacher and a representative set of inputs? If no, distillation is premature.
- Is your task narrow and high-volume enough that fixed training cost pays back? If yes, distill.
This ordering reflects effort. Each step up costs more time and risk, so you only climb when the cheaper rung fails.
Combining Approaches Instead of Choosing One
The framing of "distillation versus X" is useful for clarity, but the best production systems usually stack techniques rather than pick a single winner. Once you understand the axes, the more sophisticated move is composition.
- Distill, then quantize. Produce a smaller student architecture, then compress its weights numerically. The two operate on different dimensions, so the savings multiply rather than overlap.
- Fine-tune, then distill. Fine-tune a large model on your task to make it an excellent teacher, then distill that tuned teacher into a small student. Your student inherits task-specific quality the base model never had.
- Prompt the teacher, distill the pattern. Use careful prompting to coax the best behavior out of the teacher, capture those outputs, and distill the resulting behavior into the student so you no longer pay for the long prompt on every call.
The lesson is that the decision rule tells you where to start, not where to stop. Begin with the cheapest approach that clears your bar, then layer additional techniques only where measurement shows further headroom. Composition is how mature teams squeeze out cost without sacrificing the quality that the single-technique framing would force them to trade away.
Common Failure Modes
The trade-off analysis breaks down when teams ignore these traps.
Distilling a moving target
If your teacher model changes every quarter, your student is stale the moment you ship it. Distillation assumes a stable teacher. For fast-moving capabilities, keep prompting the live model.
Underspecified evaluation
People distill, see a good aggregate accuracy number, and ship. Then the student fails on the 5% of inputs that mattered most. Build the evaluation set before you train, and weight it toward the cases that carry business risk. The common mistakes guide covers this trap in detail.
Counting only training cost
The real cost includes data generation (teacher inference over your whole training corpus, which can be substantial), evaluation, and ongoing maintenance. Teams that budget only for the training run get surprised.
Frequently Asked Questions
Is distillation always cheaper than fine-tuning?
No. Distillation has the extra cost of running the teacher to generate labels, which can be larger than the cost of collecting human labels if your corpus is big. It is cheaper when you already have a strong teacher and lots of unlabeled inputs, and more expensive when teacher inference is itself costly.
Can I combine distillation with quantization?
Yes, and you usually should. Distill to a smaller architecture first, then quantize the student. The two techniques stack because they compress along different dimensions, one architectural and one numeric.
How much quality do I lose with distillation?
It depends entirely on task breadth and how representative your training inputs are. Narrow tasks can lose almost nothing measurable. Broad, open-ended tasks can lose a meaningful chunk. There is no universal number, which is why you must measure on your own evaluation set.
When should I not distill at all?
When volume is low, when your teacher is unstable, when the task is broad and open-ended, or when prompting a small model already meets your bar. Distillation only pays back under specific conditions, and forcing it elsewhere wastes effort.
Key Takeaways
- Distillation trains a small student to mimic a large teacher; the real decision is choosing it over quantization, fine-tuning, or prompting.
- Try the cheaper options first: prompt engineering, then quantization, then distillation only if those fail your latency or cost target.
- Five axes decide the call: latency budget, volume, task breadth, teacher quality, and tolerance for degradation.
- Budget for the full cost, including teacher inference for data generation, evaluation, and maintenance, not just the training run.
- Build your evaluation set before training and weight it toward high-risk cases, or you will ship a student that fails where it matters most.