How One Agency Tamed Its Runaway Prompts

This is the story of a mid-sized digital agency that built AI features for a roster of clients and nearly let unmanaged prompts sink one of its biggest accounts. It is composited from common patterns rather than a single named company, but every decision and consequence below mirrors what teams in this position actually face. The arc is familiar: things work, then they quietly stop working, then someone has to fix the underlying discipline.

What makes the story worth telling is not the crisis but the response. The agency did not buy an expensive platform or hire a specialist. It adopted a handful of versioning practices in sequence and measured the difference. By the end, prompt-related incidents had dropped sharply and the team could finally explain its own AI behavior.

Follow the situation, the decisions, the execution, and the outcome. Then borrow what fits.

The Situation

The agency ran AI-powered features for roughly fifteen clients: content drafting, support triage, lead classification. Each feature was driven by prompts that lived wherever the original engineer had put them, inline in code, in config files, in one case in a comment.

How the trouble started

A flagship client reported that their support assistant had started misrouting tickets. The agency investigated and hit a wall.

The prompt had been edited by three people over two months
None of the edits were documented
The original, working prompt no longer existed anywhere

The team spent two days reconstructing a prompt that approximately worked. The client noticed the downtime. Trust took a hit. Leadership decided the ad hoc approach had to end.

The Decision

Rather than over-engineer a solution, the team's lead set a deliberately modest goal: every prompt should have a recorded history, a reason for each change, and a way to roll back. Nothing more, at first.

Why they kept it small

The lead had seen ambitious tooling projects stall. A heavy system nobody adopted would be worse than a light one everybody used. The plan was to start with practices, prove the value, and only then consider tooling.

The sequence they chose mirrored the steps in A Step-by-Step Approach to Prompt Versioning: inventory first, consolidate storage, define a version format, then add evaluation and rollback.

The Execution

Over two weeks, the team worked through the plan one stage at a time, fitting it around client work rather than pausing everything.

Week one: inventory and consolidation

They searched the codebase and found 47 distinct prompts, several duplicated
Every prompt moved into the repository as a versioned file
Each got an initial version 1.0.0 and a named owner

Week two: format, evaluation, and rollback

They adopted a major-minor-patch scheme and made versions immutable
For the five highest-traffic prompts, they built small evaluation sets
They wired the application to select prompts by version number

That last change was the quiet hero. By referencing prompts by version rather than inline text, rollback went from an hour-long deploy to a one-line configuration switch, exactly the property that pays off in Prompt Versioning: Real-World Examples and Use Cases.

The Outcome

Three months after adoption, the team compared the period before and after.

What changed measurably

Prompt-related incidents dropped from roughly one per week to one in the entire quarter
The two incidents that did occur were resolved in minutes via rollback, not days
A model migration that would once have been terrifying went smoothly because the model was treated as part of the version

What changed culturally

Engineers stopped fearing prompt edits because they could always revert
Client conversations about AI behavior became factual, backed by version history
The team could finally answer "why did it say that" with a specific version

The biggest surprise was how cheap the win had been. No platform purchase, no new hire, just a fortnight of disciplined cleanup and a commitment to a few non-negotiable habits.

The First Real Test

The system was barely two weeks old when it faced a genuine emergency, and how it performed shaped the team's confidence in the new discipline.

A bad change reaches production

An engineer adjusted the content-drafting prompt to encourage a more formal tone for a particular client. The edit was reasonable, but it interacted badly with an existing instruction and started producing stilted, repetitive copy across every client using that prompt, not just the intended one.

The problem surfaced within an hour through a client's feedback
The engineer checked the version history and saw exactly what had changed and why
A single configuration switch reverted production to the previous version

What would once have been an afternoon of frantic reconstruction was instead a five-minute fix. The engineer then made the tone change correctly in a fresh version, evaluated it against the test inputs, and shipped it the next day without drama. The team later pointed to this incident as the moment the investment proved itself, because the contrast with the original misrouting crisis was so stark.

A second, quieter benefit

Beyond the fast recovery, the incident produced something subtler: a clear record that the team could share internally. New engineers could read the version history and understand not just what the prompts said but how they had evolved and why. Onboarding to a feature stopped requiring a conversation with whoever happened to remember the history, because the history now documented itself.

What They Would Do Differently

In a retrospective, the team named two regrets, both instructive.

Earlier evaluation

They built evaluation sets only for the top five prompts and wished they had covered more sooner. The prompts without evaluation gates were the ones that produced the quarter's two incidents. The case for measurement-gated promotion is laid out in Prompt Versioning: Best Practices That Actually Work.

Earlier ownership

Assigning owners during inventory, rather than after the first incident, would have prevented the original misrouting crisis entirely. The lesson echoes the warning in 7 Common Mistakes with Prompt Versioning (and How to Avoid Them) about ownerless prompts.

Frequently Asked Questions

Did the agency need special software to fix this?

No. They used their existing code repository for storage and versioning, built small evaluation sets by hand, and added a configuration value to select prompt versions at runtime. The transformation came from disciplined practices, not from purchasing a platform.

How long did the turnaround take?

The core system was in place within two weeks, worked around ongoing client commitments. The measurable payoff, a sharp drop in incidents, was visible within the following quarter. The early steps delivered most of the benefit before the system was fully built out.

What single change had the biggest impact?

Referencing prompts by version number instead of pasting text inline. That decoupling turned rollback from a slow deploy into an instant configuration switch, which is what converted potential outages into minor blips. It was also the prerequisite for clean model migrations.

Why start with practices instead of tooling?

The team had seen heavy tooling projects stall from low adoption. By proving the value of versioning with lightweight practices first, they earned the buy-in that would justify tooling later. A light system everyone uses beats a sophisticated one nobody adopts.

Were clients aware of the change?

Clients did not see the internal system, but they felt its effects: fewer incidents and faster recovery when problems did occur. The team also found that version history made client conversations about AI behavior concrete and credible rather than defensive.

Key Takeaways

Undocumented, ownerless prompts edited by multiple people quietly degraded a flagship feature until the original working version was unrecoverable.
The agency chose a deliberately modest goal, history plus change reasons plus rollback, and proved value before considering any tooling.
Inventory, consolidation, an immutable version format, and version-based references were built in two weeks around ongoing client work.
Referencing prompts by version turned rollback from an hour-long deploy into a one-line switch, the change with the largest impact.
Incidents fell from weekly to one per quarter, and the team's two regrets were not adding evaluation and ownership sooner.

Follow the situation, the decisions, the execution, and the outcome. Then borrow what fits.

The Situation

How the trouble started

A flagship client reported that their support assistant had started misrouting tickets. The agency investigated and hit a wall.

The prompt had been edited by three people over two months
None of the edits were documented
The original, working prompt no longer existed anywhere

The team spent two days reconstructing a prompt that approximately worked. The client noticed the downtime. Trust took a hit. Leadership decided the ad hoc approach had to end.

The Decision

Why they kept it small

The sequence they chose mirrored the steps in A Step-by-Step Approach to Prompt Versioning: inventory first, consolidate storage, define a version format, then add evaluation and rollback.

The Execution

Over two weeks, the team worked through the plan one stage at a time, fitting it around client work rather than pausing everything.

Week one: inventory and consolidation

They searched the codebase and found 47 distinct prompts, several duplicated
Every prompt moved into the repository as a versioned file
Each got an initial version 1.0.0 and a named owner

Week two: format, evaluation, and rollback

They adopted a major-minor-patch scheme and made versions immutable
For the five highest-traffic prompts, they built small evaluation sets
They wired the application to select prompts by version number

The Outcome

Three months after adoption, the team compared the period before and after.

What changed measurably

Prompt-related incidents dropped from roughly one per week to one in the entire quarter
The two incidents that did occur were resolved in minutes via rollback, not days
A model migration that would once have been terrifying went smoothly because the model was treated as part of the version

What changed culturally

Engineers stopped fearing prompt edits because they could always revert
Client conversations about AI behavior became factual, backed by version history
The team could finally answer "why did it say that" with a specific version

The biggest surprise was how cheap the win had been. No platform purchase, no new hire, just a fortnight of disciplined cleanup and a commitment to a few non-negotiable habits.

The First Real Test

The system was barely two weeks old when it faced a genuine emergency, and how it performed shaped the team's confidence in the new discipline.

A bad change reaches production

The problem surfaced within an hour through a client's feedback
The engineer checked the version history and saw exactly what had changed and why
A single configuration switch reverted production to the previous version

A second, quieter benefit

What They Would Do Differently

In a retrospective, the team named two regrets, both instructive.

Earlier evaluation

Earlier ownership

Frequently Asked Questions

Did the agency need special software to fix this?

How long did the turnaround take?

What single change had the biggest impact?

Why start with practices instead of tooling?

Were clients aware of the change?

Key Takeaways

Undocumented, ownerless prompts edited by multiple people quietly degraded a flagship feature until the original working version was unrecoverable.
The agency chose a deliberately modest goal, history plus change reasons plus rollback, and proved value before considering any tooling.
Inventory, consolidation, an immutable version format, and version-based references were built in two weeks around ongoing client work.
Referencing prompts by version turned rollback from an hour-long deploy into a one-line switch, the change with the largest impact.
Incidents fell from weekly to one per quarter, and the team's two regrets were not adding evaluation and ownership sooner.

How One Agency Tamed Its Runaway Prompts

The Situation

How the trouble started

The Decision

Why they kept it small

The Execution

Week one: inventory and consolidation

Week two: format, evaluation, and rollback

The Outcome

What changed measurably

What changed culturally

The First Real Test

A bad change reaches production

A second, quieter benefit

What They Would Do Differently

Earlier evaluation

Earlier ownership

Frequently Asked Questions

Did the agency need special software to fix this?

How long did the turnaround take?

What single change had the biggest impact?

Why start with practices instead of tooling?

Were clients aware of the change?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

How One Agency Tamed Its Runaway Prompts

The Situation

How the trouble started

The Decision

Why they kept it small

The Execution

Week one: inventory and consolidation

Week two: format, evaluation, and rollback

The Outcome

What changed measurably

What changed culturally

The First Real Test

A bad change reaches production

A second, quieter benefit

What They Would Do Differently

Earlier evaluation

Earlier ownership

Frequently Asked Questions

Did the agency need special software to fix this?

How long did the turnaround take?

What single change had the biggest impact?

Why start with practices instead of tooling?

Were clients aware of the change?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?