That One Accuracy Number Is Hiding a Broken Student

Model distillation transfers behavior from a large teacher model into a smaller student. The promise is a cheaper, faster model that behaves almost like the expensive one. Whether you actually got that is a measurement question, and it is the part most teams handle badly. They train, glance at one accuracy number, and call it done.

The trouble is that distillation can pass a vanity metric while quietly breaking the behavior you depend on. A student that matches the teacher 92% of the time can still fail on the exact 8% that drives revenue or safety. To know what you built, you need a deliberate, small set of metrics that cover fidelity to the teacher, end-task quality, cost, latency, and the high-stakes slices of your data.

This article defines those metrics, shows how to instrument them, and explains how to read the signal so you ship with confidence instead of hope. If you have not run a distillation end to end yet, A Step-by-Step Approach to What Is Model Distillation walks through the pipeline first.

Two Families of Metrics

Distillation metrics split into two groups, and you need both.

Fidelity metrics: does the student match the teacher?

Fidelity measures how closely the student reproduces the teacher's outputs. This is distillation-specific and easy to overlook.

Agreement rate. On a held-out set, how often does the student produce the same answer as the teacher? For classification this is exact-match agreement; for generation it is a similarity or judged-equivalence rate.
Output distribution distance. If you distilled from the teacher's probability distribution (soft labels), measure the divergence between teacher and student distributions. Lower is better and it catches problems that hard agreement misses.

Task metrics: is the student actually good?

Fidelity to a teacher is worthless if the teacher was wrong. So you also measure the student against ground truth directly.

Task accuracy, F1, or whatever your domain uses. Measured on labeled data, independent of the teacher.
Calibration. Does the student's confidence track its correctness? Distillation often hurts calibration, and miscalibrated confidence breaks any downstream thresholding.

The gap between fidelity and task metrics is itself diagnostic. High fidelity plus low task accuracy means your teacher is the bottleneck, not your distillation.

The Operational Metrics That Justify the Project

You distilled to save money or time. Measure that explicitly or you cannot defend the work.

Inference cost per 1,000 calls. Compare teacher and student on identical hardware and batch settings.
P50 and P95 latency. Always report the tail, not just the median. Distillation usually helps P95 the most, and the tail is what users feel.
Model size and memory footprint. Relevant if you deploy on constrained devices or want to fit more replicas per GPU.
Throughput. Requests per second at a fixed latency budget, which is what actually sets your serving cost.

When you assemble these for a decision-maker, frame them against the teacher baseline as percentages saved. The ROI article shows how to turn these numbers into a business case.

Slice Your Metrics or Be Misled

Aggregate numbers hide the failures that matter. Always break every metric down by slice.

The slices to always include

Input difficulty. Easy inputs inflate aggregate scores. Report hard cases separately.
Business-critical categories. The intents, document types, or customer segments where errors are expensive.
Rare classes. Distillation tends to flatten the tail of the distribution, so rare-but-important cases degrade first.

A student that scores 94% overall but 71% on your highest-value segment is not ready, and only slicing reveals that.

How to Instrument the Measurement

Metrics are only as good as the harness that produces them.

Freeze an evaluation set before training. Pull a representative sample, label it (or have the teacher label it for fidelity), and never train on it. This is your fixed yardstick across every student version.
Run teacher and student through the same harness. Identical inputs, identical post-processing. Any difference in plumbing contaminates the comparison.
Log per-example results, not just aggregates. Store the input, teacher output, student output, and ground truth for every evaluation case. This is what lets you slice later and diagnose regressions.
Automate the run. Make evaluation a single command that produces a report. Manual evaluation gets skipped under deadline pressure, which is exactly when you most need it.

For a reusable structure, see The Best Tools for What Is Model Distillation, which covers evaluation frameworks alongside training ones.

Metrics for Generative Tasks Are Harder

Everything above is cleanest for classification, where agreement and accuracy are exact. Generative tasks, summarization, extraction into prose, open-ended responses, need more care because there is no single correct string.

Approaches that work

Judged equivalence. Use a strong model (often the teacher itself, or a separate judge) to score whether the student's output is equivalent in meaning to the teacher's. Validate the judge against human ratings on a sample before trusting it.
Reference-based similarity. Compare student and teacher outputs with a semantic similarity measure rather than exact match. This catches paraphrase but can miss subtle errors, so pair it with spot human review.
Task-specific structural checks. If the output must contain certain fields or follow a format, measure format adherence separately from content quality. A student that drifts on structure fails downstream parsing even when the content is fine.

The trap with generative evaluation is leaning entirely on automated judges without ever validating them against humans. Always anchor your automated metric to a human-rated sample, or you are measuring the judge's quirks rather than the student's quality.

Tracking Metrics Over Time

A single evaluation is a snapshot; the value compounds when you track the same metrics across versions and into production.

Version-over-version comparison. Every redistillation should be measured against the previous student on the identical frozen set, so you can see whether a change helped or quietly regressed a slice.
Production monitoring. Sample real production traffic, label it (with the teacher or humans), and track accuracy on the same slices you used at training time. This is how you catch drift before it becomes a customer problem.
A regression gate. Wire a minimum per-slice threshold into your release process so no student ships if it regresses a critical slice, regardless of how good the aggregate looks.

This turns metrics from a one-time check into an ongoing safety net, which is what keeps a distilled model trustworthy over its whole life.

Reading the Signal

Numbers without interpretation are noise. A few patterns to recognize:

High agreement, low task accuracy: the teacher is the ceiling. Improve the teacher or accept the limit. Do not blame distillation.
High aggregate, low on critical slices: you have a coverage problem in your training inputs. Add representative data for the weak slices and redistill.
Good accuracy, bad calibration: add temperature scaling or recalibrate the student's outputs before relying on confidence thresholds.
Latency improved but cost did not: check batching and hardware utilization; a small model on idle hardware can still be expensive per call.

Frequently Asked Questions

What is the single most important distillation metric?

There is not one. You need at least agreement rate (fidelity), task accuracy on critical slices, and cost-per-call. Reporting any one alone is how teams ship students that fail in production. The combination is the point.

How big should my evaluation set be?

Large enough that your critical slices each have enough examples to produce a stable number, often a few hundred per slice. The aggregate set can be smaller than you think; the slices are what drive the required size.

Should I measure against the teacher or against ground truth?

Both, and the gap between them is informative. Fidelity to the teacher tells you whether distillation worked. Accuracy against ground truth tells you whether the result is actually good. They answer different questions.

Why does my distilled model's confidence seem off?

Distillation frequently degrades calibration, so the student's stated confidence no longer matches its real accuracy. Recalibrate with a method like temperature scaling on your evaluation set before using confidence for any decision threshold.

Key Takeaways

Measure two families: fidelity to the teacher and task quality against ground truth; the gap between them tells you where the bottleneck is.
Always report operational metrics (cost per call, P95 latency, throughput) against the teacher baseline to justify the project.
Slice every metric by difficulty, business-critical category, and rare class, because aggregates hide the failures that matter most.
Freeze an evaluation set before training, run teacher and student through one harness, and log per-example results so you can diagnose regressions.
Watch calibration; distillation often breaks it, and miscalibrated confidence silently corrupts any downstream thresholding.

Two Families of Metrics

Distillation metrics split into two groups, and you need both.

Fidelity metrics: does the student match the teacher?

Fidelity measures how closely the student reproduces the teacher's outputs. This is distillation-specific and easy to overlook.

Agreement rate. On a held-out set, how often does the student produce the same answer as the teacher? For classification this is exact-match agreement; for generation it is a similarity or judged-equivalence rate.
Output distribution distance. If you distilled from the teacher's probability distribution (soft labels), measure the divergence between teacher and student distributions. Lower is better and it catches problems that hard agreement misses.

Task metrics: is the student actually good?

Fidelity to a teacher is worthless if the teacher was wrong. So you also measure the student against ground truth directly.

Task accuracy, F1, or whatever your domain uses. Measured on labeled data, independent of the teacher.
Calibration. Does the student's confidence track its correctness? Distillation often hurts calibration, and miscalibrated confidence breaks any downstream thresholding.

The gap between fidelity and task metrics is itself diagnostic. High fidelity plus low task accuracy means your teacher is the bottleneck, not your distillation.

The Operational Metrics That Justify the Project

You distilled to save money or time. Measure that explicitly or you cannot defend the work.

Inference cost per 1,000 calls. Compare teacher and student on identical hardware and batch settings.
P50 and P95 latency. Always report the tail, not just the median. Distillation usually helps P95 the most, and the tail is what users feel.
Model size and memory footprint. Relevant if you deploy on constrained devices or want to fit more replicas per GPU.
Throughput. Requests per second at a fixed latency budget, which is what actually sets your serving cost.

When you assemble these for a decision-maker, frame them against the teacher baseline as percentages saved. The ROI article shows how to turn these numbers into a business case.

Slice Your Metrics or Be Misled

Aggregate numbers hide the failures that matter. Always break every metric down by slice.

The slices to always include

Input difficulty. Easy inputs inflate aggregate scores. Report hard cases separately.
Business-critical categories. The intents, document types, or customer segments where errors are expensive.
Rare classes. Distillation tends to flatten the tail of the distribution, so rare-but-important cases degrade first.

A student that scores 94% overall but 71% on your highest-value segment is not ready, and only slicing reveals that.

How to Instrument the Measurement

Metrics are only as good as the harness that produces them.

Freeze an evaluation set before training. Pull a representative sample, label it (or have the teacher label it for fidelity), and never train on it. This is your fixed yardstick across every student version.
Run teacher and student through the same harness. Identical inputs, identical post-processing. Any difference in plumbing contaminates the comparison.
Log per-example results, not just aggregates. Store the input, teacher output, student output, and ground truth for every evaluation case. This is what lets you slice later and diagnose regressions.
Automate the run. Make evaluation a single command that produces a report. Manual evaluation gets skipped under deadline pressure, which is exactly when you most need it.

For a reusable structure, see The Best Tools for What Is Model Distillation, which covers evaluation frameworks alongside training ones.

Metrics for Generative Tasks Are Harder

Approaches that work

Judged equivalence. Use a strong model (often the teacher itself, or a separate judge) to score whether the student's output is equivalent in meaning to the teacher's. Validate the judge against human ratings on a sample before trusting it.
Reference-based similarity. Compare student and teacher outputs with a semantic similarity measure rather than exact match. This catches paraphrase but can miss subtle errors, so pair it with spot human review.
Task-specific structural checks. If the output must contain certain fields or follow a format, measure format adherence separately from content quality. A student that drifts on structure fails downstream parsing even when the content is fine.

Tracking Metrics Over Time

A single evaluation is a snapshot; the value compounds when you track the same metrics across versions and into production.

Version-over-version comparison. Every redistillation should be measured against the previous student on the identical frozen set, so you can see whether a change helped or quietly regressed a slice.
Production monitoring. Sample real production traffic, label it (with the teacher or humans), and track accuracy on the same slices you used at training time. This is how you catch drift before it becomes a customer problem.
A regression gate. Wire a minimum per-slice threshold into your release process so no student ships if it regresses a critical slice, regardless of how good the aggregate looks.

This turns metrics from a one-time check into an ongoing safety net, which is what keeps a distilled model trustworthy over its whole life.

Reading the Signal

Numbers without interpretation are noise. A few patterns to recognize:

High agreement, low task accuracy: the teacher is the ceiling. Improve the teacher or accept the limit. Do not blame distillation.
High aggregate, low on critical slices: you have a coverage problem in your training inputs. Add representative data for the weak slices and redistill.
Good accuracy, bad calibration: add temperature scaling or recalibrate the student's outputs before relying on confidence thresholds.
Latency improved but cost did not: check batching and hardware utilization; a small model on idle hardware can still be expensive per call.

Frequently Asked Questions

What is the single most important distillation metric?

How big should my evaluation set be?

Should I measure against the teacher or against ground truth?

Why does my distilled model's confidence seem off?

Key Takeaways

Measure two families: fidelity to the teacher and task quality against ground truth; the gap between them tells you where the bottleneck is.
Always report operational metrics (cost per call, P95 latency, throughput) against the teacher baseline to justify the project.
Slice every metric by difficulty, business-critical category, and rare class, because aggregates hide the failures that matter most.
Freeze an evaluation set before training, run teacher and student through one harness, and log per-example results so you can diagnose regressions.
Watch calibration; distillation often breaks it, and miscalibrated confidence silently corrupts any downstream thresholding.

That One Accuracy Number Is Hiding a Broken Student

Two Families of Metrics

Fidelity metrics: does the student match the teacher?

Task metrics: is the student actually good?

The Operational Metrics That Justify the Project

Slice Your Metrics or Be Misled

The slices to always include

How to Instrument the Measurement

Metrics for Generative Tasks Are Harder

Approaches that work

Tracking Metrics Over Time

Reading the Signal

Frequently Asked Questions

What is the single most important distillation metric?

How big should my evaluation set be?

Should I measure against the teacher or against ground truth?

Why does my distilled model's confidence seem off?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

That One Accuracy Number Is Hiding a Broken Student

Two Families of Metrics

Fidelity metrics: does the student match the teacher?

Task metrics: is the student actually good?

The Operational Metrics That Justify the Project

Slice Your Metrics or Be Misled

The slices to always include

How to Instrument the Measurement

Metrics for Generative Tasks Are Harder

Approaches that work

Tracking Metrics Over Time

Reading the Signal

Frequently Asked Questions

What is the single most important distillation metric?

How big should my evaluation set be?

Should I measure against the teacher or against ground truth?

Why does my distilled model's confidence seem off?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?