Most teams evaluate text-to-speech by listening to a demo, nodding, and shipping. Then real users hit edge cases the demo never covered, and the voice that sounded great on a marketing sentence mangles a phone number, drops a pause, or mispronounces the product name on every call. Understanding how AI text to speech works tells you what to build; measuring it tells you whether what you built actually works.
This is a practical guide to the metrics that matter for synthetic speech, how to instrument them without a research lab, and how to read the signal once it starts flowing. The throughline: pair a small set of subjective human judgments with a few objective system metrics, and you will catch problems long before your users do.
The Two Families of Metrics
Speech quality splits into what humans perceive and what machines can measure. You need both.
Subjective metrics
These come from human listeners and capture what objective math misses, like whether a voice sounds warm or trustworthy.
- Mean Opinion Score (MOS). Listeners rate samples on a 1-to-5 scale. It remains the industry reference for perceived naturalness. The catch is that MOS is relative to the test set and the raters, so it is only meaningful when you compare candidates under identical conditions.
- Preference (A/B) testing. Instead of an absolute score, you ask which of two clips sounds better. This is often more reliable than MOS because humans are better at comparison than absolute rating.
Objective metrics
These are computed automatically and are cheap to run on every build.
- Word Error Rate (WER) via round-trip ASR. Run your synthesized audio back through a speech recognizer and compare the transcript to the input text. A spike in WER flags pronunciation and intelligibility regressions.
- Real-Time Factor (RTF). Synthesis time divided by audio duration. An RTF below 1.0 means you generate faster than real time, which is essential for streaming.
Latency: The Metric Users Feel First
For anything interactive, latency is not a nice-to-have. It is the experience.
Time to first audio byte
The single most important number for streaming voice is how long until the user hears the first sound. Total render time barely matters in a conversation; the silence before audio starts is what feels broken. Instrument this separately from end-to-end duration.
Latency distribution, not averages
A median time-to-first-audio of 200ms means nothing if your p95 is two seconds. Voice failures cluster in the tail. Always track p95 and p99, because that is where users abandon. This is the same discipline we push in our step-by-step approach to how AI text to speech works.
Correctness Metrics You Cannot Skip
Naturalness gets the attention, but correctness is what breaks trust.
Pronunciation accuracy
Build a fixed regression suite of the hard cases for your domain: brand names, acronyms, units, currencies, dates, and any homographs that matter. "Lead" the metal versus "lead" the verb will bite you. Run this suite on every voice or model change and treat a regression as a release blocker.
Number and entity handling
Phone numbers, prices, and addresses are where generic voices fall apart. Measure these explicitly. Spelling out "$1,250.00" as "one thousand two hundred fifty dollars" versus "one two five zero" is a correctness bug, not a style preference, and it is one of the common mistakes with how AI text to speech works that slips past demos.
How to Instrument Without a Research Team
You do not need a perception lab. You need a pipeline.
- Build a golden test set. Fifty to two hundred representative utterances covering your real traffic, including the ugly edge cases. Version it.
- Automate the objective metrics. Run round-trip WER and RTF on the golden set in CI on every model or config change. Fail the build on regression thresholds you set.
- Schedule periodic human eval. Once per release candidate, run an A/B preference test with a handful of raters. You do not need hundreds; even five to ten careful listeners catch most regressions.
- Sample production audio. Log a small random sample of real synthesized output and review it weekly. Production traffic always surprises you.
Reading the Signal
Metrics only help if you know how to interpret them together.
Triangulate, never trust one number
A clean WER with a falling MOS usually means intelligible but flat or robotic delivery. A high MOS with rising WER means a pleasant voice that is starting to mangle words. The combination tells the story that either number alone hides.
Set thresholds, then watch slopes
Absolute thresholds catch hard failures. Trend lines catch slow decay, like a vendor quietly changing a model behind an API. Alert on both a hard floor and a sustained downward slope. For turning these numbers into a decision, the framework for how AI text to speech works shows how to weight each metric for your use case.
Cost and Reliability Metrics
Quality is not the only thing worth measuring. At scale, two operational metrics protect your budget and your uptime.
Cost per unit of output
Track cost per character, per second of audio, or per request, whichever maps to your billing, and watch it as volume grows. A quiet rise in cost per unit can mean inefficient SSML, redundant re-synthesis of identical text, or a pricing tier you have outgrown. Pairing this metric with a cache hit rate tells you how much of your spend is avoidable repetition versus genuinely new synthesis.
Error and availability rates
Synthesis services fail intermittently: timeouts, rate limits, and degraded responses. Instrument the rate of failed and retried requests, and the availability of your synthesis path end to end. A rising error rate is often the first sign of a vendor problem or a capacity limit, and catching it on a dashboard beats catching it through user complaints. Tie these to alerts so reliability regressions surface as fast as quality ones.
Frequently Asked Questions
What is a good MOS score for production?
There is no universal number because MOS depends on your raters and test set. What matters is the gap to a human-recorded reference under identical conditions. Aim to be within a small fraction of human-recorded audio, and always report MOS alongside the reference it was measured against.
Why use ASR to measure my own TTS?
Round-trip ASR gives you a cheap, automatable proxy for intelligibility. If a recognizer can transcribe your synthesized speech accurately, a human almost certainly can too. A sudden rise in word error rate is an early warning that pronunciation or clarity has regressed, even before humans notice.
How many human raters do I actually need?
For catching regressions, far fewer than research papers use. Five to ten careful listeners running A/B preference tests will surface most meaningful quality drops. Reserve large rater pools for high-stakes launches where small quality differences carry real revenue impact.
Should I optimize for average latency or tail latency?
Tail latency. Users remember the worst moments, not the typical ones. A great median with an ugly p99 produces a voice that feels unreliable. Track p95 and p99 time-to-first-audio and treat the tail as your real latency budget.
How often should I re-run evaluation?
Run objective metrics automatically on every model or configuration change. Run human evaluation on every release candidate. Sample and review production audio weekly. Vendors change models behind APIs without notice, so continuous measurement is what protects you.
Key Takeaways
- Pair subjective metrics (MOS, A/B preference) with objective ones (round-trip WER, RTF) because neither tells the whole story alone.
- For interactive use, time to first audio byte is the metric users feel; track its p95 and p99, not its average.
- Maintain a versioned golden test set heavy on edge cases, and run objective metrics in CI on every change.
- Pronunciation and number handling are correctness bugs, not style choices; gate releases on a pronunciation regression suite.
- Triangulate metrics and watch both hard thresholds and trend slopes to catch slow decay from changing vendor models.