A guide tells you what synthetic data is. A playbook tells you what to do on Tuesday when a model is underperforming and someone proposes generating more data. This is the second kind of document.
The structure here is deliberate: each play has a trigger (what makes you reach for it), an owner (who runs it), and a sequence (what has to be true before you start). Treat it as a menu you select from based on the problem in front of you — not a pipeline you run end to end every time. Most teams need two or three of these plays, not all of them.
Before any play: establish the baseline
Do not generate a single synthetic example until you have a real-only baseline model and a strictly real held-out validation set. This is non-negotiable. Without it, you can't tell whether synthetic data helped, hurt, or did nothing.
Trigger: You're considering synthetic data for the first time on this project. Owner: The ML lead who owns the model's metrics. Sequence:
- Train a model on real data only and record its scores.
- Freeze a real-only validation set and never let synthetic data touch it.
- Write down the specific metric and threshold synthetic data needs to beat.
Everything downstream measures against this line. If you skip it, you'll be arguing about vibes for the rest of the project.
Play 1: Close a coverage gap
Trigger: Your error analysis shows the model failing on a specific, identifiable slice — a rare class, an underrepresented segment, an edge case.
This is the highest-value play because the need is precise. You're not adding generic volume; you're adding examples of a known weakness.
Owner: The engineer who ran the error analysis. Sequence:
- Quantify the gap: how many real examples of the failing slice exist, and how many would balance it.
- Generate only for that slice, not the whole dataset.
- Mix the synthetic slice in and retrain.
- Check the slice metric improved without dragging down others.
If the slice metric moves and nothing else regresses, ship it. If overall performance drops, your synthetic slice is off-distribution — fix the generator before you add more. The Real-World Examples and Use Cases piece shows what successful coverage gaps look like in practice.
A common variant of this play: the gap isn't a class but a combination — say, a particular input type under a particular condition that real logs rarely capture together. Generate for the intersection, not the marginals. Filling each dimension separately won't help if the model fails specifically where they overlap.
Play 2: Bootstrap a cold start
Trigger: A new product or feature has no usage data, but you need a working model now.
Owner: The product-facing ML engineer. Sequence:
- Generate a synthetic dataset from your best understanding of the domain.
- Train a v0 model — explicitly treat it as disposable.
- Ship v0 behind monitoring to start collecting real interactions.
- Replace synthetic data with real data as it accumulates, retraining on a schedule.
The mistake here is falling in love with v0. Synthetic data was scaffolding; tear it down as real data arrives. Set a calendar reminder to revisit the synthetic fraction every two weeks during cold start.
Play 3: Distill from a stronger model
Trigger: You have access to a strong, expensive model and want a smaller, cheaper one that approximates it.
Owner: The engineer optimizing for inference cost or latency. Sequence:
- Confirm the teacher model's license permits using its outputs this way.
- Generate labeled examples with the teacher across a diverse prompt set.
- Filter generated examples — drop low-confidence and obviously wrong ones.
- Train the student on the filtered set.
- Compare the student to the teacher on real test data, not synthetic.
The filtering step is where teams cut corners and pay later. Unfiltered teacher output bakes hallucinations into the student permanently. A cheap, effective filter: have the teacher generate each example more than once and keep only the cases where it agrees with itself. Disagreement is a strong signal of low-confidence output you don't want to train on.
Play 4: Unblock a privacy constraint
Trigger: Real data exists but can't be used directly due to privacy or regulatory limits.
Owner: The ML lead, working with legal or compliance. Sequence:
- Define the privacy requirement precisely — what must not be reconstructable.
- Generate synthetic data that preserves statistical structure without copying records.
- Run membership-inference tests to verify real records can't be recovered.
- Document the privacy validation for audit before training.
Synthetic does not equal private by default. Without step 3, you may have built a dataset that leaks the exact information you were trying to protect.
Owners at a glance
- ML lead: baseline, privacy play, final ratio sign-off.
- Error-analysis engineer: coverage-gap play.
- Product ML engineer: cold-start play.
- Optimization engineer: distillation play.
Naming owners prevents the common failure where everyone assumes someone else validated the synthetic data.
The cross-cutting discipline: tune the ratio
Whatever play you run, the synthetic-to-real ratio is a hyperparameter, not a constant. Sweep it (0%, 25%, 50%, 75%), measure each point on the real-only validation set, and pick the peak. Re-sweep whenever you change the generator or the task.
This single discipline prevents the slow drift into model collapse, where each generation trained on synthetic output loses the tails of the distribution. For the framework behind these ratio decisions, see A Framework for Synthetic Data in Ai Training.
Sequencing the plays together
When a project needs more than one play, order matters:
- Baseline first, always.
- Cold start if there's no real data, until real data exists.
- Coverage gap once you have real data and error analysis.
- Distillation or privacy as standalone needs, layered on top.
Running coverage-gap work before you have a baseline is the most common sequencing error. You end up tuning against a moving target. Lock the baseline, then choose your play.
Frequently Asked Questions
How many of these plays should a typical team run?
Most run two or three. The baseline is mandatory. After that, pick the plays that match your actual problem — a coverage gap, a cold start, a cost target, or a privacy block. Running plays you don't need adds risk without benefit.
Who should own the synthetic data decision?
The ML lead who owns the model's headline metric should sign off on the final synthetic-to-real ratio, but the engineer closest to the specific problem runs the play. Splitting it this way keeps accountability clear and prevents the "I thought you validated it" failure.
Can I run multiple plays on the same dataset?
Yes, but track each synthetic source separately so you can attribute regressions. If a cold-start set and a coverage-gap set are both in your training data and quality drops, you need to know which one to pull. Label the provenance of every example.
What if the synthetic data makes the baseline worse?
Pull it and investigate the generator before adding more. A regression against the real-only baseline means your synthetic data is off-distribution, low-fidelity, or in the wrong ratio. The baseline exists precisely so you can catch this immediately rather than after deployment.
How often should I re-run the ratio sweep?
Every time you change the generator, the task definition, or add a significant amount of real data. The optimal ratio is not stable across these changes, and stale ratios are a quiet source of degradation.
Key Takeaways
- Establish a real-only baseline and a real-only validation set before generating anything — it's the line every play measures against.
- Choose plays by problem: coverage gap, cold start, distillation, or privacy. Most teams need two or three, not all four.
- Each play needs a named owner; the ML lead signs off on the final ratio while the closest engineer runs the play.
- Treat the synthetic-to-real ratio as a tuned hyperparameter and re-sweep it whenever the generator or task changes.
- Sequence matters: baseline first, cold start until real data exists, then coverage and standalone plays on top.