This is a single project, told as a story. The setup is a composite based on common patterns rather than one company's confidential figures, but the arc β situation, decision, execution, outcome, lessons β mirrors how these projects actually unfold. The goal is to show the decision points in sequence, so you can recognize them in your own work.
The team runs a high-traffic product that classifies incoming documents into a fixed taxonomy. They started with a large language model that did the job beautifully and cost a fortune at their volume. This case study follows how they moved to a distilled student, what nearly went wrong, and how it landed.
The Situation
The product ingests documents continuously and assigns each one to a category. At launch, volume was modest and the large model's cost was a rounding error. Eighteen months later, volume had grown by more than an order of magnitude, and the inference bill had become one of the largest line items in the product's infrastructure budget.
The Constraint
The category accuracy was excellent and the business depended on it. Any cost-cutting move that degraded accuracy meaningfully was off the table. The team needed the same quality at a much lower cost, and they needed it without a long research detour.
There was a second, quieter constraint: organizational patience. The team had a window of a few weeks to show progress before leadership would look for other ways to control the inference bill. That deadline shaped every decision that followed. They could not afford a six-month research project; they needed a method that was well understood, low-risk, and fast to validate. Distillation fit because the technique is mature and the data they needed already existed in their logs.
The Decision
Before committing to distillation, the team did the responsible thing and ran the cheaper alternatives first β the best practices call this the "do nothing" baseline.
- They tried a smaller off-the-shelf model. It was cheaper but lost too much accuracy on the harder categories.
- They tried prompt optimization on the smaller model. It closed some of the gap but not enough.
- They estimated the engineering cost of a distillation pipeline against the projected savings.
The math was decisive. At their volume, even a modest per-document cost reduction would pay back the pipeline within weeks. The task was narrow, the volume was huge, and they had years of correctly categorized documents to train on. Distillation was the right call. They committed.
The Execution
The team followed roughly the sequence in our step-by-step how-to, and the project lived or died on the data.
Building the Prompt Set
They pulled real documents from production logs rather than inventing examples, and matched the training distribution to live traffic by category. This took longer than expected because the live distribution was skewed β a few categories dominated, and several critical categories were rare.
Generating and Filtering Teacher Outputs
They ran the documents through the large model and captured its classifications. Crucially, because every document had a known correct category from historical labels, they could verify the teacher's outputs and drop the ones the teacher got wrong. This filtering step removed a meaningful chunk of the data and raised the quality ceiling of the student.
The First Student Missed
Their first student looked great in aggregate β high overall accuracy β and they almost shipped it. Then they evaluated by slice and found it was failing badly on the rare critical categories, exactly the mistake of judging only the aggregate. The rare categories were underrepresented in training, so the student never learned them well.
The Fix
They oversampled the rare categories in the training set, retrained, and added a fallback: any document the student classified with low confidence was routed back to the teacher. The second student cleared the per-slice bar.
The oversampling was a judgment call worth dwelling on. Naively duplicating rare examples risks the student memorizing them rather than generalizing. The team mitigated this by generating additional teacher outputs specifically for the underrepresented categories rather than just repeating the few they had, which gave the student genuine variety in those slices instead of the same handful of documents seen over and over. The confidence threshold for the teacher fallback was tuned empirically: too aggressive and too many documents went to the expensive teacher, erasing the savings; too lax and low-quality classifications slipped through. They landed on a threshold that sent only a small minority of documents to the teacher while catching the bulk of the student's genuine uncertainty.
The Outcome
The numbers that mattered, framed as ranges rather than invented precision:
- Cost. Per-document inference cost dropped several-fold, turning a major budget line into a minor one.
- Quality. The student preserved nearly all of the teacher's accuracy overall, and met the bar on every critical slice after the oversampling fix.
- Latency. Classification got noticeably faster, an unplanned bonus that improved the ingestion pipeline.
- Fallback rate. A small fraction of low-confidence documents still went to the teacher, which kept quality high without giving back most of the savings.
The payback period on the engineering investment was short, exactly as the projection suggested.
The Lessons
Three lessons generalize beyond this project.
First, the "do nothing" and cheaper-alternative baselines were not wasted effort β they justified the project and would have saved months if they had succeeded. Always run them.
Second, the project was won on data, not training. The distribution matching and teacher filtering determined the outcome. The training itself was unremarkable.
Third, aggregate metrics nearly caused a bad ship. Slice-based evaluation caught the failure on rare categories, and the hybrid fallback to the teacher turned a hard quality problem into a manageable one. If you take one thing from this case study, make it the habit of evaluating by slice and keeping a teacher fallback for the cases your student handles poorly.
A fourth lesson is softer but real. The team's deadline pressure could have pushed them to ship the first student on its strong aggregate number. The discipline that saved them was procedural, not heroic: they had committed, before training, to a per-slice bar as part of their definition of success. Because the bar was written down in advance, the failing slice was a clear stop signal rather than a judgment call they could rationalize away under deadline pressure. Defining success rigorously up front is not bureaucracy; it is what makes good decisions survive contact with a deadline.
Frequently Asked Questions
Why didn't the team just use the smaller off-the-shelf model?
They tried, and it lost too much accuracy on the harder categories. Prompt optimization narrowed the gap but still fell short of the quality bar. Distillation was the only option that hit both the cost and quality targets, so it justified the extra engineering.
What was the turning point in the project?
Discovering, through slice-based evaluation, that the first student was failing on rare critical categories despite strong aggregate accuracy. That finding drove the oversampling fix and the teacher fallback, which together made the student shippable.
How important was having historical labels?
Very. Known-correct categories let the team filter the teacher's wrong outputs and evaluate the student rigorously. Without a ground-truth signal, both filtering and evaluation would have been far harder and the result less reliable.
Did the teacher fallback undermine the cost savings?
No, because only a small fraction of low-confidence documents went to the teacher. The vast majority were handled by the cheap student, so the blended cost stayed low while the fallback protected quality on the hard cases.
Could this have failed?
Easily. Had they shipped the first student on aggregate metrics alone, the rare-category failure would have hit production silently. The discipline of slice evaluation and a gradual rollout is what kept the project from becoming a cautionary tale.
Key Takeaways
- High volume plus a narrow, measurable task made distillation an easy financial decision once cheaper baselines fell short.
- The project was won on data work β distribution matching and teacher filtering β not on training cleverness.
- Aggregate accuracy nearly caused a bad ship; slice-based evaluation caught the failure on rare critical categories.
- Oversampling rare categories plus a low-confidence fallback to the teacher delivered both cost savings and quality.
- The engineering investment paid back quickly because the cost reduction scaled with the product's large volume.