A consumer fintech company we worked with sent roughly forty thousand lifecycle emails a month: onboarding sequences, payment reminders, dunning notices, feature announcements, and win-back campaigns. A four-person content team wrote all of it by hand, and the queue was always a week behind. The obvious move was to draft with a language model. The non-obvious problem was that this brand had spent three years building a voice customers actually recognized β plain-spoken, slightly wry, never condescending about money β and that voice was the company's most defensible asset. If AI drafts eroded it, the efficiency gain would cost them the thing that made the emails work.
This is the story of how they introduced AI into that pipeline, the register failure that surfaced in week two, and the tone-control system they built to keep forty thousand monthly emails sounding like one consistent human. The numbers at the end are theirs, shared with permission and rounded.
The Situation: Speed Versus Voice
The content lead framed the tension precisely in the kickoff. Hand-written emails took an average of forty minutes each including review. The team could draft maybe eight a day at quality. Demand was for three times that. AI could close the gap, but only if the drafts came out in-voice enough that editors were polishing rather than rewriting. A draft that needed a full rewrite saved no time at all.
The first attempt and what broke
The initial prompt was a paragraph describing the brand as "approachable, smart, and human." For about a week it looked fine. Then a payment-failure email went out that opened "Oops! Looks like your payment didn't go through! π " β far too breezy for a moment when a customer is anxious about money. The brand's actual voice was warm but never flippant about finances. "Approachable" had been read by the model as jokey, and in a sensitive context that flippancy read as the company not taking the customer's money seriously.
- The vague adjective "approachable" gave the model latitude it used badly in emotionally loaded contexts.
- Register requirements were context-dependent: the right tone for a feature announcement was wrong for a dunning notice.
- A single global voice description could not encode that context sensitivity.
The Decision: Build a Register Spec, Not a Vibe
The team stopped trying to describe the voice in prose and instead decomposed it into named, testable rules. This mirrors the structured approach detailed in The Anatomy of a Reusable Brand Voice Prompt, which they adapted to their pipeline.
What went into the spec
They wrote explicit, checkable instructions rather than adjectives:
- Contractions always on (the brand never wrote "do not" when "don't" fit).
- A banned-word list: no "oops," no "uh-oh," no emoji in financial-status emails, no "super" as an intensifier, no exclamation points in dunning sequences.
- A per-context register table: announcements could be playful; payment and account-security emails had to be calm, direct, and reassuring.
- A hedging rule: state what the customer needs to do in one clear sentence before any explanation.
The key shift was treating context as a first-class input. The prompt now received the email type, and the type selected which register profile applied. A reminder email and a security alert no longer drew from the same tone.
The Execution: Profiles, Examples, and a Review Gate
Anchoring each profile with examples
For each register profile they pasted two hand-written exemplars into the prompt β one short, one longer β so the model had a concrete target rather than an abstract description. The exemplars carried voice nuances the rules could not fully capture: a particular dry rhythm, the habit of leading with the customer's situation before the company's.
Instrumenting the output
They did not trust the drafts blindly. Editors scored each AI draft on a five-point in-voice scale before publishing, and those scores fed a weekly review. The scoring approach drew directly on Scoring Whether Generated Tone Actually Fits the Reader. When a profile's average score dropped, they knew which one to fix.
Closing the loop
Low-scoring drafts were traced back to the prompt. Two patterns recurred: the announcement profile occasionally drifted toward marketing hype, and the security profile sometimes over-hedged. Each got one new rule. Within three weeks the in-voice scores stabilized above the team's publish threshold.
The Outcome: The Numbers
After two months on the new system the team reported the following, measured against their pre-AI baseline.
What the measurements showed
- Average draft-to-send time fell from forty minutes to roughly twelve, because editors were polishing in-voice drafts instead of rewriting off-voice ones.
- Throughput rose enough to clear the backlog and ship the three-times-volume the business wanted.
- The in-voice score, once the spec stabilized, sat consistently above the publish threshold, and the flippant-tone incidents that had worried leadership did not recur.
- Engagement metrics on lifecycle emails held flat, which the team counted as a win: they had tripled volume without degrading the voice customers responded to.
The honest caveat: the first three weeks were slower than hand-writing, because building and tuning the register spec was real work. The payoff came only after the profiles stabilized.
The Lessons
Decompose voice into testable rules
The breakthrough was abandoning prose descriptions in favor of named, checkable rules: contraction policy, banned words, per-context profiles, hedging rule. A spec you can audit beats a vibe you can only feel.
Make context an input
Register is context-dependent. The same brand needs different tones for a celebration and a security alert. Feeding the email type into the prompt and selecting a profile from it was the change that prevented the flippant-payment-email failure.
Measure before you trust
The scoring gate caught drift early and told the team exactly which profile to repair. Without instrumentation, the slow erosion of voice would have been invisible until customers noticed. When the team weighed whether the effort was worth it, the framing in Putting Real Numbers Behind a Tone-Control Investment matched their own payback math. And for teams starting fresh, the same path is compressed in Your Fastest Route to a First Reliable Tone Spec.
Frequently Asked Questions
Why did a simple voice description fail at first?
Because a single prose description could not encode context sensitivity. "Approachable" was right for an announcement and disastrously flippant for a payment-failure notice. The model had no way to know the context mattered until the team made the email type an explicit input that selected a different register profile.
How long did it take to see results?
The first three weeks were actually slower than hand-writing, because building and tuning the register spec took real effort. The payback came in month two, when draft-to-send time fell from forty minutes to about twelve and the backlog cleared.
What made the per-context profiles work?
Each profile combined explicit rules (contractions, banned words, hedging) with two hand-written exemplars. The rules handled hard constraints; the exemplars carried the dry rhythm and structure that rules could not fully describe. Feeding the email type into the prompt selected the right profile automatically.
How did the team catch register drift?
Editors scored every draft on a five-point in-voice scale before publishing, and those scores rolled up into a weekly review. When a profile's average dipped, the team traced low-scoring drafts back to the prompt and added one corrective rule.
Did engagement metrics suffer?
No. Engagement held flat while volume tripled, which the team counted as the real win. They had scaled output threefold without degrading the voice that made the emails effective in the first place.
Could a smaller team replicate this?
Yes, at smaller scale. The core moves β decompose voice into rules, make context an input, score drafts before sending β work for any volume. A solo operator can run a lighter version with one register profile and spot-check scoring rather than a full review cadence.
Key Takeaways
- A single prose voice description failed because it could not encode that register requirements change with context.
- The fix was decomposing voice into named, testable rules: contraction policy, banned words, per-context profiles, and a hedging rule.
- Making the email type an explicit input let the prompt select the right register profile and prevented flippant tone in sensitive contexts.
- Two hand-written exemplars per profile captured voice nuances that rules alone could not express.
- A pre-send in-voice scoring gate caught register drift early and pointed to the exact profile that needed repair.
- Draft-to-send time fell from forty minutes to about twelve, tripling throughput without degrading engagement, after a three-week tuning investment.