AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationA Single Expensive ModelThe ConstraintWhy a Single Model Was a LiabilityThe DecisionRoute Across Three ModelsSeparate Intent From Implementation FirstThe ExecutionBuilding the Frozen Test SetFirst Contact With the Cheap ModelThe RecoveryValidating the Reasoning ModelThe OutcomeMeasurable ResultsA More Resilient SystemThe LessonsRefactor Before You MigrateRouting Beats One-Size-Fits-AllThe Test Set Was the Real AssetWhat They Would Do DifferentlyBuild the Test Set FirstBudget Time for the Reasoning Model's QuirksCommunicate the Plan to the Client EarlyFrequently Asked QuestionsWhy did the team route across three models instead of picking one?What was the first thing they did before adding new models?Why did the cheap model fail on the first attempt?What change made the reasoning model succeed?What were the measurable outcomes?What was the most durable benefit of the migration?Key Takeaways
Home/Blog/One Team Migrated a Prompt Across Three Models
General

One Team Migrated a Prompt Across Three Models

A

Agency Script Editorial

Editorial Team

·January 7, 2020·8 min read
prompting across different model architecturesprompting across different model architectures case studyprompting across different model architectures guideprompt engineering

This is the story of a single prompt and the journey it took from working on one model to working reliably across three. The names and numbers are illustrative, but the arc is faithful to how this kind of migration actually unfolds: a clear business reason, a confident start, an unpleasant surprise, a disciplined recovery, and a measurable improvement that justified the effort.

The team was a mid-sized agency running a document-extraction feature for a client. The feature pulled structured data from contracts, and it ran on a single capable but expensive model. The mandate from the client was simple: cut cost and improve latency without losing accuracy. That mandate set the whole migration in motion.

What follows is the situation they faced, the decision they made, how they executed it, what it produced, and the lessons that generalize. If you want the conceptual map behind their moves, The Complete Guide to Prompting Across Different Model Architectures covers it.

The Situation

A Single Expensive Model

The extraction feature lived on one large model. It was accurate but slow and costly at the client's volume. As usage grew, the per-document cost became the line item the client wanted reduced, and latency complaints from end users were mounting.

The Constraint

Accuracy could not drop. The extracted fields fed downstream automation, so an error propagated. Any cheaper or faster model had to match the incumbent's accuracy on the team's real documents, not on a benchmark.

Why a Single Model Was a Liability

Beyond cost, depending on one model meant one point of failure and zero negotiating leverage. The team recognized that a prompt able to run across architectures was strategically safer, an insight reinforced by the brittleness risks in Stress-Testing Prompts Before They Reach a Client.

The Decision

Route Across Three Models

The team decided to support three models: keep the expensive one for the hardest documents, add a cheaper fast model for routine ones, and evaluate a reasoning model for the ambiguous middle. Routing by document difficulty would cut average cost while protecting accuracy where it mattered.

Separate Intent From Implementation First

Before touching any new model, they refactored the existing prompt to separate the extraction intent from the model-specific scaffolding. This made the prompt adaptable instead of welded to its original model, the foundational move from Cross-Model Prompting Principles Worth Defending.

  • They wrote the field-extraction intent in model-neutral language
  • They isolated format and reasoning scaffolding into a swappable layer
  • This refactor alone took a day and paid off repeatedly

The Execution

Building the Frozen Test Set

They assembled two hundred real contracts with hand-verified correct extractions as a frozen test set. Every model would face the identical set, making the comparison fair. This set became the spine of the whole migration.

First Contact With the Cheap Model

The cheap fast model, run with the original scaffolding, produced a nasty surprise: it dropped fields on ambiguous documents and occasionally wrapped output in explanatory prose. Accuracy on the frozen set came in well below the incumbent. The naive port had failed.

The Recovery

Rather than abandon the cheap model, the team diagnosed the specific gaps. They added an explicit output contract with null-on-missing, which stopped the dropped fields and the prose. They restricted the cheap model to the routine documents the routing layer identified as low-ambiguity. On that subset, its accuracy matched the incumbent.

  • An explicit contract closed the formatting and dropped-field gaps
  • Routing kept the cheap model on documents it handled well
  • The reasoning model, tested on ambiguous documents, beat the cheap model there

Validating the Reasoning Model

For the ambiguous middle, the reasoning model shone, but only after the team removed the step-by-step cue carried over from the original prompt. With a clean problem statement, it matched the incumbent's accuracy on hard documents at lower latency. The over-instruction trap, caught and removed.

The Outcome

Measurable Results

After routing was live, the team measured a meaningful drop in average per-document cost and a reduction in average latency, while accuracy on the frozen test set held even with the incumbent across the full document mix. The client mandate was met without an accuracy regression.

A More Resilient System

Beyond the numbers, the team now had a prompt that ran across three architectures and a frozen test set that made adding a fourth straightforward. The single-model liability was gone. The recurring-validation habit they adopted is the discipline detailed in Building a Repeatable Workflow for Prompt Sensitivity and Robustness Testing.

The Lessons

Refactor Before You Migrate

Separating intent from implementation before touching new models was what made the rest tractable. Teams that skip this step rewrite the whole prompt per model and burn out before finishing.

Routing Beats One-Size-Fits-All

The biggest win came from sending each document to the model that handled it best, not from finding one model for everything. Architecture-aware routing turned the migration from a compromise into an improvement.

The Test Set Was the Real Asset

The two hundred verified contracts did more than validate the migration; they became permanent infrastructure that de-risks every future model change. The migration's lasting value was as much the test set as the cost savings.

What They Would Do Differently

Build the Test Set First

In hindsight, the team wished they had assembled the frozen test set before deciding anything else. They built it partway through, after the cheap model's first failure already cost them a day of confused debugging. With the set in hand from the start, that failure would have been a clear, measured result rather than a surprise.

Budget Time for the Reasoning Model's Quirks

The reasoning model's need for a cleaner, sparser prompt caught them off guard, because every instinct from years of chat-model work pushed toward adding instructions, not removing them. They underestimated how much unlearning the reasoning family required. Next time they would test reasoning models with a deliberately minimal prompt first.

  • Assemble verified test data before evaluating any model
  • Expect reasoning models to want less instruction, not more
  • Treat the first naive port as a diagnostic step, not a setback

Communicate the Plan to the Client Early

The team kept the migration internal until results were in, which made the eventual cost report land well but left the client uninformed during the bumpy middle. They concluded that sharing the approach earlier would have built confidence and set realistic expectations about the diagnostic phase. The robustness reporting habit this points toward is covered in Building a Repeatable Workflow for Prompt Sensitivity and Robustness Testing.

Frequently Asked Questions

Why did the team route across three models instead of picking one?

Because different documents had different difficulty, and no single model was best for all of them at the right cost. Routing easy documents to a cheap model, ambiguous ones to a reasoning model, and the hardest to the incumbent cut average cost while protecting accuracy where it mattered.

What was the first thing they did before adding new models?

They refactored the existing prompt to separate the extraction intent from the model-specific scaffolding. This made the prompt adaptable rather than welded to its original model, and it turned later adaptation into editing a thin layer instead of rebuilding the prompt for each architecture.

Why did the cheap model fail on the first attempt?

Run with the original scaffolding, it dropped fields on ambiguous documents and sometimes wrapped output in prose, scoring below the incumbent. The naive port ignored the model's different defaults. An explicit output contract and routing it to only low-ambiguity documents recovered its accuracy.

What change made the reasoning model succeed?

Removing the step-by-step cue carried over from the original prompt. The reasoning model already reasons internally, so the explicit cue hurt it. With a clean problem statement it matched the incumbent's accuracy on hard documents at lower latency.

What were the measurable outcomes?

A meaningful reduction in average per-document cost and lower average latency, with accuracy on the frozen test set holding even with the incumbent across the full document mix. The client's mandate to cut cost and latency without losing accuracy was met.

What was the most durable benefit of the migration?

The frozen test set of two hundred verified contracts. Beyond validating this migration, it became permanent infrastructure that de-risks every future model change, making it straightforward to evaluate a fourth model. The lasting asset was as much the test set as the cost savings.

Key Takeaways

  • A client mandate to cut cost and latency without losing accuracy drove the cross-model migration.
  • Refactoring to separate intent from scaffolding before touching new models made the rest tractable.
  • A naive port to a cheap model failed; an explicit contract plus difficulty-based routing recovered accuracy.
  • Removing a leftover step-by-step cue let the reasoning model match the incumbent on hard documents.
  • The frozen test set of verified contracts became permanent infrastructure that de-risks future model changes.
  • In hindsight they would build the test set first, expect reasoning models to want less, and brief the client earlier.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification