This is the story of a single prompt and the journey it took from working on one model to working reliably across three. The names and numbers are illustrative, but the arc is faithful to how this kind of migration actually unfolds: a clear business reason, a confident start, an unpleasant surprise, a disciplined recovery, and a measurable improvement that justified the effort.
The team was a mid-sized agency running a document-extraction feature for a client. The feature pulled structured data from contracts, and it ran on a single capable but expensive model. The mandate from the client was simple: cut cost and improve latency without losing accuracy. That mandate set the whole migration in motion.
What follows is the situation they faced, the decision they made, how they executed it, what it produced, and the lessons that generalize. If you want the conceptual map behind their moves, The Complete Guide to Prompting Across Different Model Architectures covers it.
The Situation
A Single Expensive Model
The extraction feature lived on one large model. It was accurate but slow and costly at the client's volume. As usage grew, the per-document cost became the line item the client wanted reduced, and latency complaints from end users were mounting.
The Constraint
Accuracy could not drop. The extracted fields fed downstream automation, so an error propagated. Any cheaper or faster model had to match the incumbent's accuracy on the team's real documents, not on a benchmark.
Why a Single Model Was a Liability
Beyond cost, depending on one model meant one point of failure and zero negotiating leverage. The team recognized that a prompt able to run across architectures was strategically safer, an insight reinforced by the brittleness risks in Stress-Testing Prompts Before They Reach a Client.
The Decision
Route Across Three Models
The team decided to support three models: keep the expensive one for the hardest documents, add a cheaper fast model for routine ones, and evaluate a reasoning model for the ambiguous middle. Routing by document difficulty would cut average cost while protecting accuracy where it mattered.
Separate Intent From Implementation First
Before touching any new model, they refactored the existing prompt to separate the extraction intent from the model-specific scaffolding. This made the prompt adaptable instead of welded to its original model, the foundational move from Cross-Model Prompting Principles Worth Defending.
- They wrote the field-extraction intent in model-neutral language
- They isolated format and reasoning scaffolding into a swappable layer
- This refactor alone took a day and paid off repeatedly
The Execution
Building the Frozen Test Set
They assembled two hundred real contracts with hand-verified correct extractions as a frozen test set. Every model would face the identical set, making the comparison fair. This set became the spine of the whole migration.
First Contact With the Cheap Model
The cheap fast model, run with the original scaffolding, produced a nasty surprise: it dropped fields on ambiguous documents and occasionally wrapped output in explanatory prose. Accuracy on the frozen set came in well below the incumbent. The naive port had failed.
The Recovery
Rather than abandon the cheap model, the team diagnosed the specific gaps. They added an explicit output contract with null-on-missing, which stopped the dropped fields and the prose. They restricted the cheap model to the routine documents the routing layer identified as low-ambiguity. On that subset, its accuracy matched the incumbent.
- An explicit contract closed the formatting and dropped-field gaps
- Routing kept the cheap model on documents it handled well
- The reasoning model, tested on ambiguous documents, beat the cheap model there
Validating the Reasoning Model
For the ambiguous middle, the reasoning model shone, but only after the team removed the step-by-step cue carried over from the original prompt. With a clean problem statement, it matched the incumbent's accuracy on hard documents at lower latency. The over-instruction trap, caught and removed.
The Outcome
Measurable Results
After routing was live, the team measured a meaningful drop in average per-document cost and a reduction in average latency, while accuracy on the frozen test set held even with the incumbent across the full document mix. The client mandate was met without an accuracy regression.
A More Resilient System
Beyond the numbers, the team now had a prompt that ran across three architectures and a frozen test set that made adding a fourth straightforward. The single-model liability was gone. The recurring-validation habit they adopted is the discipline detailed in Building a Repeatable Workflow for Prompt Sensitivity and Robustness Testing.
The Lessons
Refactor Before You Migrate
Separating intent from implementation before touching new models was what made the rest tractable. Teams that skip this step rewrite the whole prompt per model and burn out before finishing.
Routing Beats One-Size-Fits-All
The biggest win came from sending each document to the model that handled it best, not from finding one model for everything. Architecture-aware routing turned the migration from a compromise into an improvement.
The Test Set Was the Real Asset
The two hundred verified contracts did more than validate the migration; they became permanent infrastructure that de-risks every future model change. The migration's lasting value was as much the test set as the cost savings.
What They Would Do Differently
Build the Test Set First
In hindsight, the team wished they had assembled the frozen test set before deciding anything else. They built it partway through, after the cheap model's first failure already cost them a day of confused debugging. With the set in hand from the start, that failure would have been a clear, measured result rather than a surprise.
Budget Time for the Reasoning Model's Quirks
The reasoning model's need for a cleaner, sparser prompt caught them off guard, because every instinct from years of chat-model work pushed toward adding instructions, not removing them. They underestimated how much unlearning the reasoning family required. Next time they would test reasoning models with a deliberately minimal prompt first.
- Assemble verified test data before evaluating any model
- Expect reasoning models to want less instruction, not more
- Treat the first naive port as a diagnostic step, not a setback
Communicate the Plan to the Client Early
The team kept the migration internal until results were in, which made the eventual cost report land well but left the client uninformed during the bumpy middle. They concluded that sharing the approach earlier would have built confidence and set realistic expectations about the diagnostic phase. The robustness reporting habit this points toward is covered in Building a Repeatable Workflow for Prompt Sensitivity and Robustness Testing.
Frequently Asked Questions
Why did the team route across three models instead of picking one?
Because different documents had different difficulty, and no single model was best for all of them at the right cost. Routing easy documents to a cheap model, ambiguous ones to a reasoning model, and the hardest to the incumbent cut average cost while protecting accuracy where it mattered.
What was the first thing they did before adding new models?
They refactored the existing prompt to separate the extraction intent from the model-specific scaffolding. This made the prompt adaptable rather than welded to its original model, and it turned later adaptation into editing a thin layer instead of rebuilding the prompt for each architecture.
Why did the cheap model fail on the first attempt?
Run with the original scaffolding, it dropped fields on ambiguous documents and sometimes wrapped output in prose, scoring below the incumbent. The naive port ignored the model's different defaults. An explicit output contract and routing it to only low-ambiguity documents recovered its accuracy.
What change made the reasoning model succeed?
Removing the step-by-step cue carried over from the original prompt. The reasoning model already reasons internally, so the explicit cue hurt it. With a clean problem statement it matched the incumbent's accuracy on hard documents at lower latency.
What were the measurable outcomes?
A meaningful reduction in average per-document cost and lower average latency, with accuracy on the frozen test set holding even with the incumbent across the full document mix. The client's mandate to cut cost and latency without losing accuracy was met.
What was the most durable benefit of the migration?
The frozen test set of two hundred verified contracts. Beyond validating this migration, it became permanent infrastructure that de-risks every future model change, making it straightforward to evaluate a fourth model. The lasting asset was as much the test set as the cost savings.
Key Takeaways
- A client mandate to cut cost and latency without losing accuracy drove the cross-model migration.
- Refactoring to separate intent from scaffolding before touching new models made the rest tractable.
- A naive port to a cheap model failed; an explicit contract plus difficulty-based routing recovered accuracy.
- Removing a leftover step-by-step cue let the reasoning model match the incumbent on hard documents.
- The frozen test set of verified contracts became permanent infrastructure that de-risks future model changes.
- In hindsight they would build the test set first, expect reasoning models to want less, and brief the client earlier.