The fastest way to waste a quarter on synthetic data is to generate a million records, eyeball a few, declare them "realistic," and ship. Realistic-looking and useful-for-training are different properties, and the gap between them only surfaces in production, where it is expensive to fix.
Measuring synthetic data well means answering three separate questions, not one. Does it match the real distribution? Does it leak private information? And does a model trained on it actually work? Each question has its own metrics, its own instrumentation, and its own way of lying to you if you read it wrong. This article defines the KPIs that matter, shows how to wire them up, and explains how to read the signal.
The Three Question Framework
Every synthetic data evaluation collapses to three independent axes. Score them separately — a dataset can be statistically faithful, privacy-safe, and still useless for your task.
Fidelity: does it match reality?
Fidelity asks whether the synthetic distribution resembles the real one. The cheap version is comparing marginals — does each column's histogram match. The honest version checks joint structure, because preserving every column individually while destroying the correlations between them produces data that is statistically plausible and semantically nonsense.
Utility: does it train good models?
Utility is the only metric that pays your bills. It measures whether a model trained on synthetic data performs on real data. Everything else is a proxy for this.
Privacy: does it leak?
Privacy asks whether someone could reconstruct a real individual from the synthetic set. A perfect copy of your real data has perfect fidelity and zero privacy. The tension between these two is the whole game.
Utility: The TSTR Standard
The single most important measurement is Train on Synthetic, Test on Real (TSTR). Train your model on the synthetic dataset, evaluate it on a held-out set of real, hand-labeled data, and record the score on your real task metric — accuracy, F1, AUC, whatever you ship on.
Then run the baseline: Train on Real, Test on Real (TRTR). The ratio of TSTR to TRTR is your utility score. A ratio of 0.95 means your synthetic data captures 95 percent of the training value of real data. Anything above 0.9 is strong; below 0.7 and you are training on a distorted picture of the world.
The non-negotiable rule: the test set is always real and never touches your generator. If you test on synthetic data, your numbers are theater. This is the most common instrumentation failure we see, and the common mistakes guide treats it as mistake number one.
Fidelity Metrics Worth Tracking
Fidelity metrics are leading indicators — cheaper to compute than full TSTR, useful for catching problems early in a generation pipeline.
- Marginal distribution distance. For each column, compute a Jensen-Shannon divergence or Wasserstein distance between real and synthetic. Catches obvious shifts but misses correlation breakage.
- Correlation matrix difference. Compute the pairwise correlation matrix for real and synthetic, then take the elementwise difference. Large gaps mean your generator preserved columns but lost their relationships.
- Discriminator score. Train a classifier to tell real from synthetic. If it achieves 50 percent accuracy, the two are indistinguishable — ideal. If it hits 99 percent, your synthetic data has an obvious tell.
- Coverage and density. Coverage checks whether synthetic samples span the full real distribution; density checks whether they cluster in plausible regions. Low coverage with high density is the signature of mode collapse — diverse-looking output drawn from a narrow slice.
The tools roundup covers libraries that compute most of these out of the box.
Privacy Metrics: Measuring Leakage
If your synthetic data exists to satisfy a privacy constraint, you must prove it does not memorize.
Distance to closest record
For each synthetic record, find its nearest real record. If many synthetic records sit nearly on top of real ones, your generator is copying. Track the distribution of these distances and flag the dangerous tail — a handful of near-duplicates can leak a real person.
Membership inference resistance
Run a membership inference attack: train an attacker to guess whether a given real record was in the generator's training set. If the attacker does no better than chance, your privacy is strong. This is the metric regulators and security reviewers respect, and our risks article explains why it matters more than vibes-based privacy claims.
How to Instrument the Pipeline
Metrics only help if they run automatically. Bolt them into the generation pipeline as gates, not as an afterthought report.
- Hold out real data first. Before any generation, split off a real test set and a real validation set. These never touch the generator.
- Compute fidelity on every batch. After each generation run, automatically score marginals, correlations, and the discriminator. Fail the run if any breaches a threshold.
- Run TSTR nightly. Utility is expensive, so run it on a schedule rather than per-batch. Plot the utility ratio over time to catch drift as your generator updates.
- Gate releases on privacy. No synthetic dataset ships to a downstream team until distance-to-closest-record and membership inference pass.
- Track everything versioned. Tie every metric to a dataset version and generator version so you can answer "what changed" when a number moves.
For where to slot this into a broader rollout, see the team rollout guide.
Reading the Signal Without Fooling Yourself
The hardest part is interpretation. A high fidelity score with low utility means your data looks right but lacks the discriminative features your model needs — common when you preserve easy columns and lose the hard ones. High utility with weak privacy means you are essentially leaking real data and got lucky on the task. And a utility ratio that looks great but degrades each time you retrain the generator is the early warning of recursive collapse.
Treat no single number as the verdict. The honest scorecard is three numbers — fidelity, utility, privacy — read together, against real benchmarks, over time.
Frequently Asked Questions
What is the single most important synthetic data metric?
The utility ratio from Train on Synthetic, Test on Real divided by Train on Real, Test on Real. It directly measures whether your synthetic data trains models that work on reality, which is the only outcome that matters in production.
Can I trust a high fidelity score on its own?
No. Fidelity measures statistical resemblance, not training value. Synthetic data can match every marginal distribution while losing the correlations and rare features your model needs, producing a high fidelity score and poor utility. Always confirm with TSTR.
How do I measure privacy in synthetic data?
Use distance-to-closest-record to catch near-duplicates and a membership inference attack to test whether an adversary can identify training members. Passing both gives defensible evidence that your synthetic data does not memorize real individuals.
Why must the test set be real, not synthetic?
Because testing on synthetic data only proves your model learned your generator's quirks, not the real world. A real held-out test set is the only honest measure of production performance, and it must never touch the generator that made your synthetic data.
How often should I run these metrics?
Run cheap fidelity checks on every generation batch, expensive TSTR utility runs nightly or per release, and privacy gates before any dataset ships downstream. Tie each result to dataset and generator versions so you can trace regressions.
Key Takeaways
- Score synthetic data on three separate axes — fidelity, utility, and privacy — never one.
- Utility via TSTR over TRTR is the metric that predicts production performance; target a ratio above 0.9.
- The test set is always real and never touches the generator.
- Fidelity metrics like correlation difference and discriminator score are cheap leading indicators.
- Prove privacy with distance-to-closest-record and membership inference resistance, not assertions.
- Instrument metrics as automated gates in the pipeline, versioned against generator and dataset.