Case Study: Open vs Closed Source AI Models in Practice

The cleanest way to understand the open-versus-closed decision is to watch a team live through it. This is a composite case study built from patterns we see repeatedly: a SaaS company that started entirely on a closed API, grew into a cost problem, and worked out which parts of their workload belonged on open models and which did not.

The details are illustrative rather than from a single named company, but the arc, the decisions, and the trade-offs are real. The point is the reasoning at each stage and how the team avoided the traps that sink less disciplined groups.

The Situation

The company sells a content operations platform. Their flagship AI feature does three things: it classifies incoming documents, it extracts structured fields, and it drafts summaries for human review. At launch, all three ran through a single closed frontier API.

For the first year this was the right call. The team was small, volume was modest, and the closed API let them ship the feature in weeks rather than building infrastructure. They had no reason to consider open models, and considering them would have been premature optimization.

Where They Started

All three tasks on one closed frontier model.
Per-token API billing, growing with usage.
No model abstraction; the provider's SDK was called throughout the codebase.

The Trigger

Eighteen months in, two things changed at once. Usage scaled roughly tenfold as larger customers onboarded, and the monthly API bill became one of the company's largest variable costs. At the same time, an enterprise prospect demanded that their documents never leave a controlled environment.

The flagship feature was now both a cost problem and a deal blocker. The team could no longer treat the model decision as settled. They needed to revisit it with discipline, not panic.

The Decision Process

Rather than declaring "we are going open" or negotiating harder with their provider, the team characterized each of their three tasks separately. This was the pivotal move. Treating the workload as one monolith would have led to a worse answer.

What They Found

Classification: High volume, routine, low difficulty. A strong candidate for a cheap open model.
Extraction: Very high volume, well-defined, latency-sensitive. The biggest cost driver and an ideal self-hosting target.
Summarization: Lower volume, higher difficulty, quality-critical for the human reviewers. Best kept on the frontier model.

They built an eval set of about 80 real examples per task and ran a bake-off, scoring on cost per successful task rather than per token, exactly as our step-by-step approach recommends.

The Execution

First, the team did the unglamorous work they had skipped at launch: they abstracted every model call behind a thin internal interface. This took a couple of weeks and made everything that followed a contained change rather than codebase-wide surgery.

Then they migrated classification and extraction to a fine-tuned open model on managed GPU hosting—getting open-weight economics without owning raw infrastructure. Summarization stayed on the closed frontier model. For the enterprise prospect with residency requirements, they offered a fully self-hosted open-model deployment as a premium tier.

The Sequence That Worked

Abstract the model interface first.
Migrate the highest-volume, lowest-difficulty tasks to open.
Keep the hardest, quality-critical task on the closed frontier model.
Offer self-hosted open weights as a compliance tier for regulated customers.

The Outcome

The per-task cost on the two migrated workloads dropped substantially, because extraction and classification at their volume were far cheaper on owned open models than on per-token API pricing. Summarization quality stayed high because they did not force it onto a cheaper model that failed their reviewers' bar.

The residency-bound enterprise deal closed, because the self-hosted option gave an architectural guarantee no contractual promise could match. Critically, the thin interface meant that when a better open model shipped months later, swapping it in was a one-day change validated against their existing eval set.

The Lessons

The team's biggest insight was that "open versus closed" was the wrong framing. The right framing was "which model for which task." Decomposing the workload turned an unwinnable all-or-nothing debate into three clear, evidence-based decisions.

Their second lesson was that the infrastructure work they skipped at launch—the model abstraction—was correctly skipped. Doing it earlier would have been premature. Doing it exactly when they hit the migration was right on time. For the broader playbook behind these moves, see our best practices article, and to avoid the pitfalls they sidestepped, the common mistakes guide.

What Could Have Gone Wrong

It is worth naming the failure paths the team avoided, because each was a plausible alternate history. Had they reacted to the cost spike by declaring "we are going open" across the board, they would have forced summarization onto a cheaper model and degraded the quality their human reviewers depended on. The downstream cost of unreliable summaries—reviewer rework and customer complaints—would likely have exceeded the API savings.

Had they instead tried to self-host all three tasks on raw GPUs to maximize savings, they would have taken on operational burden their team was not staffed to carry. The likely result is the classic trap: senior engineers pulled into inference firefighting, reliability incidents, and "savings" consumed by salary. Choosing managed open hosting for the migrated tasks let them capture most of the economic benefit without that burden.

The third averted mistake was subtler. Without the eval set, they would have had no objective way to confirm that the open models actually matched the closed model on classification and extraction. They might have migrated on faith, shipped a quality regression, and not noticed until customers did. The eval set turned a risky leap into a measured, reversible step.

How They Operate Now

A year past the migration, the team's posture is a stable routed portfolio. New AI features start on the closed frontier model for speed, exactly as the flagship did. As each feature matures and its volume and requirements clarify, they run it through the same task decomposition and bake-off, and route its components to wherever the evidence points.

The thin interface and the living eval set have become permanent infrastructure rather than one-time projects. Model migrations that once felt threatening are now routine maintenance scheduled into ordinary engineering cycles. The original all-or-nothing debate never recurred, because the team replaced it with a repeatable process.

Frequently Asked Questions

Why didn't the team go open from the start?

Because at launch their volume was modest and their team was small. Building self-hosting infrastructure before proving the feature would have been wasted effort. Starting closed let them ship fast and validate value before paying any infrastructure tax.

What was the single most important decision?

Decomposing the workload into three separate tasks instead of treating it as one. That move converted an ideological all-or-nothing debate into three evidence-based decisions, each resolved by measuring cost per successful task against a real eval set.

Why keep summarization on the closed model?

Because it was lower volume, higher difficulty, and quality-critical for human reviewers. The cost savings from moving it to open would have been small, and the quality risk was real. The decision was driven by that specific task's properties, not a blanket policy.

How did the model abstraction pay off later?

When a better open model shipped, swapping it in was a single-day change validated against the existing eval set, rather than a codebase-wide rewrite. The abstraction turned future model migrations from crises into routine maintenance.

Key Takeaways

Starting fully closed was correct while volume was low and the team was small.
The cost and a residency-bound deal jointly triggered a disciplined revisit.
Decomposing the workload into separate tasks was the decision that unlocked everything.
Migrating high-volume routine tasks to open while keeping hard tasks closed optimized both cost and quality.
Abstracting the model interface turned later migrations into one-day, eval-validated changes.

The Situation

Where They Started

All three tasks on one closed frontier model.
Per-token API billing, growing with usage.
No model abstraction; the provider's SDK was called throughout the codebase.

The Trigger

The flagship feature was now both a cost problem and a deal blocker. The team could no longer treat the model decision as settled. They needed to revisit it with discipline, not panic.

The Decision Process

What They Found

Classification: High volume, routine, low difficulty. A strong candidate for a cheap open model.
Extraction: Very high volume, well-defined, latency-sensitive. The biggest cost driver and an ideal self-hosting target.
Summarization: Lower volume, higher difficulty, quality-critical for the human reviewers. Best kept on the frontier model.

They built an eval set of about 80 real examples per task and ran a bake-off, scoring on cost per successful task rather than per token, exactly as our step-by-step approach recommends.

The Execution

The Sequence That Worked

Abstract the model interface first.
Migrate the highest-volume, lowest-difficulty tasks to open.
Keep the hardest, quality-critical task on the closed frontier model.
Offer self-hosted open weights as a compliance tier for regulated customers.

The Outcome

The Lessons

What Could Have Gone Wrong

How They Operate Now

Frequently Asked Questions

Why didn't the team go open from the start?

What was the single most important decision?

Why keep summarization on the closed model?

How did the model abstraction pay off later?

Key Takeaways

Starting fully closed was correct while volume was low and the team was small.
The cost and a residency-bound deal jointly triggered a disciplined revisit.
Decomposing the workload into separate tasks was the decision that unlocked everything.
Migrating high-volume routine tasks to open while keeping hard tasks closed optimized both cost and quality.
Abstracting the model interface turned later migrations into one-day, eval-validated changes.

Case Study: Open vs Closed Source AI Models in Practice

The Situation

Where They Started

The Trigger

The Decision Process

What They Found

The Execution

The Sequence That Worked

The Outcome

The Lessons

What Could Have Gone Wrong

How They Operate Now

Frequently Asked Questions

Why didn't the team go open from the start?

What was the single most important decision?

Why keep summarization on the closed model?

How did the model abstraction pay off later?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Case Study: Open vs Closed Source AI Models in Practice

The Situation

Where They Started

The Trigger

The Decision Process

What They Found

The Execution

The Sequence That Worked

The Outcome

The Lessons

What Could Have Gone Wrong

How They Operate Now

Frequently Asked Questions

Why didn't the team go open from the start?

What was the single most important decision?

Why keep summarization on the closed model?

How did the model abstraction pay off later?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?