Principles are easy to nod along to and hard to apply. To show how data collection actually unfolds, this article follows one representative team building a model from a messy start to a measurable outcome. The scenario is composite and illustrative, designed to surface the decisions and trade-offs a real team faces rather than to report on a specific named company.
The team's goal: build an assistant that drafts accurate replies to incoming customer emails for a mid-sized software company. They had data, ambition, and the usual temptation to skip straight to the model. Here is how it went.
The Situation
The company received thousands of support emails a week. Agents answered them by hand, and response quality varied with who was on shift. New agents took weeks to get up to speed, and during busy periods the backlog grew faster than the team could clear it. Leadership wanted an assistant to draft first replies that agents could approve or edit, cutting handle time without sacrificing accuracy.
The team's first instinct was to fine-tune a model on every email they had ever sent. That instinct, left unchecked, would have produced a mess. What they had was raw, inconsistent, and full of personal data. What they needed was a curated, labeled, privacy-safe dataset matched to a clearly defined behavior.
The Decision: Define Before Collecting
Before exporting anything, the team wrote down the exact behavior they wanted and drafted ten ideal email-and-reply pairs by hand. This took an afternoon and changed everything.
Those examples revealed that "draft a reply" was too vague. Replies fell into categories: troubleshooting, billing questions, feature requests, and escalations. Each needed a different tone and structure. The handwritten examples became the specification, exactly as our step-by-step guide recommends.
The Execution: Collecting and Cleaning
With the behavior defined, the team mapped sources. Their support transcripts were the obvious first-party goldmine, but they could not be used raw.
The Cleaning Pipeline
- Privacy scrubbing. Names, account numbers, and other personal data were redacted before anything else, because the legal team flagged it as a hard requirement.
- Deduplication. Templated and near-identical replies were collapsed so the model would not over-learn boilerplate.
- Filtering. Low-quality exchanges, including unresolved threads and agent venting, were dropped.
- Categorization. Each exchange was tagged into one of the four reply categories from the specification.
The team logged provenance for every batch: which export, which date range, what rights applied. This felt like overhead until an audit later asked exactly those questions and they had instant answers.
The Labeling Problem
Categorization was where things nearly went wrong. The first round of tagging was done by three people working from a one-paragraph instruction, and their agreement was poor. Billing-versus-escalation in particular split annotators down the middle.
Rather than push forward, the team stopped and rewrote the instructions with concrete edge-case examples, then re-measured agreement. It climbed sharply. This pause cost two days and saved the project from training on contradictory labels, a failure mode detailed in our common mistakes article.
The Subtle Catch: Contamination
When the team built their evaluation set, they pulled recent emails to test against. Someone noticed that several of those exact threads were also sitting in the training data, because both came from the same export.
They decontaminated, removing every training example that overlapped with the test set. Without that catch, their evaluation would have rewarded memorization, and the model would have looked far better in testing than in production.
The Outcome
The final dataset was a fraction of the size of the team's original "use everything" plan. It was curated, categorized, privacy-safe, and decontaminated. The model trained on it produced drafts that agents approved with light edits far more often than drafts from an earlier, larger, dirtier attempt.
The team had actually built that earlier attempt first, training on the full raw export before the discipline kicked in. Its drafts were generic, occasionally leaked phrasing that read like another customer's thread, and confused the four reply categories. Comparing the two attempts side by side was the moment the lesson landed: the constraint was never the model, it was the data feeding it.
The lesson the team took away was counterintuitive: the smaller, slower, more disciplined dataset won decisively. The discipline that felt like friction during collection was exactly what produced the result. For the reusable model behind their approach, see the framework article.
What the Team Would Do Differently
No project is run perfectly, and in the retrospective the team named a few things they would change next time.
First, they would build the privacy scrubbing into the export itself rather than as a separate cleaning pass. Doing it after the fact meant raw personal data briefly lived in their working files, an avoidable exposure.
Second, they would write the labeling instructions with edge cases from the start. The poor first-round agreement was entirely predictable; they simply had not invested in the instructions before handing the task off, and they paid for it with a two-day rework.
Third, they would separate the test set before doing any other processing, so contamination could not happen in the first place. Catching it late worked, but designing it out would have been safer.
The Transferable Lessons
The specifics of this team's situation are less important than the moves that generalized. Any team collecting training data can borrow these:
- Define the behavior with handwritten examples before exporting anything. It reframes vague goals into concrete tasks.
- Handle privacy first, as a hard gate, not a later cleanup.
- Treat poor annotator agreement as a signal to fix instructions, not push forward.
- Seal and decontaminate the test set so your metrics stay honest.
- Trust a small, disciplined dataset over a large, messy one for focused tasks.
These are the same moves our step-by-step guide and best practices recommend, validated here against a realistic project. The discipline is not academic; it is what produced a model agents actually trusted.
Frequently Asked Questions
What was the team's biggest early mistake?
Their instinct to fine-tune on every email they had ever sent. That would have fed the model raw, inconsistent, privacy-laden data. Catching that instinct and defining the behavior first reframed the entire project around curation rather than accumulation.
Why did defining the behavior matter so much?
Writing ideal examples by hand exposed that "draft a reply" was actually four distinct tasks with different tones and structures. Without that clarity, the dataset would have blurred the categories together and the model would have produced generic, often inappropriate drafts.
How did the labeling pause pay off?
Stopping to rewrite instructions raised annotator agreement sharply, which meant the model learned consistent categories instead of contradictions. The two-day delay was trivial compared to the cost of training on labels that three people interpreted three different ways.
What would have happened without decontamination?
The evaluation set overlapped with training data, so the model would have scored well by memorizing answers it had already seen. The team would have shipped based on inflated numbers and then watched performance drop on genuinely new emails in production.
Why did the smaller dataset win?
Because quality and consistency beat raw size for a focused task. The curated dataset had clean labels, no privacy landmines, no contamination, and good category coverage. The larger dataset had more examples but more noise, and the noise dragged the model down.
Key Takeaways
- Defining the target behavior and writing ideal examples first reframed a vague goal into four concrete tasks.
- Privacy scrubbing was a hard requirement, handled before any other cleaning step.
- Pausing to fix poor annotator agreement prevented training on contradictory labels.
- Catching test-set contamination kept the evaluation honest and avoided shipping on inflated scores.
- The smaller, disciplined dataset outperformed the larger, dirtier one decisively.