Concrete AI Stack Decisions, Walked Through Front to Back

Principles for choosing an AI tech stack only become useful when you watch them collide with real constraints. A rule like "default to hosted models" sounds clean until you meet a team whose data cannot leave their building. The interesting learning lives in specific situations, where the trade-offs are forced and the right answer is not the textbook one. Examples teach what abstractions cannot.

This article walks through several distinct scenarios. Each names the constraints up front, traces the choices made at each layer, and is honest about what worked and what did not. The teams and details are illustrative composites, not case files, but the trade-offs are the real ones you will face. Read them as patterns you can map onto your own situation rather than recipes to copy.

Scenario One: A Support Chatbot on a Tight Budget

A small company wants a chatbot that answers customer questions from their help documentation. The volume is moderate, the answers must be grounded in their docs, and the budget is genuinely tight.

The choices and why

Model: a smaller, cheaper hosted model, because the task is constrained and a frontier model would burn budget for capability the problem does not need.
Data: retrieval over the help docs, because the answers must come from their content, not the model's general knowledge.
Glue: minimal explicit code, since the flow is a simple retrieve-then-answer.

What worked was letting the constrained problem justify a cheap model. What nearly failed was an early instinct to skip retrieval, which produced confident answers that were not actually from the docs. Adding grounding fixed it. This mirrors the priority on data quality in Everything That Goes Into an AI Tech Stack Decision.

The lesson hidden in the near-failure

The instructive part is not that the cheap model worked; it is why the first version felt fine and was actually broken. Without retrieval, the model answered from general knowledge and sounded authoritative, so a casual demo looked successful. The failure only surfaced when someone asked about a policy specific to the company and got a plausible answer that was simply made up. This is the trap of judging an AI system by whether it produces fluent output rather than correct output. The team's recovery came from grounding every answer in the actual documentation and rejecting answers the docs did not support, which traded a little fluency for a lot of trustworthiness.

Scenario Two: A Regulated Firm With Data That Cannot Leave

A financial services team wants to summarize internal documents, but compliance forbids sending that data to an external API. The textbook default of hosted models is off the table.

The choices and why

Model: a self-hosted open model, because data residency is a hard, non-negotiable constraint.
Data: retrieval over internal documents, kept entirely inside their environment.
Glue: more substantial, since self-hosting brings operational responsibilities the previous scenario avoided.

What worked was recognizing early that the residency constraint overrode the convenience of hosted models, and budgeting for the operational complexity instead of being surprised by it. The cost was real, but the alternative was no project at all.

The trap this scenario avoided

The failure mode lurking here is the team that wants the convenience of a hosted model so badly that it tries to argue its way around the residency constraint, perhaps by anonymizing data or routing it through some intermediary. That path usually ends in a compliance problem that is far more expensive than the operational burden of self-hosting would have been. This team avoided it by accepting the constraint as fixed and designing within it from the start. The discipline is to treat a genuine hard constraint as a wall, not an obstacle to be negotiated, and to budget honestly for what working inside that wall costs rather than pretending the wall is softer than it is.

Scenario Three: A High-Volume Classification Task

A product team needs to classify millions of incoming items per day into categories. Accuracy matters, but so does the per-item cost, because volume is enormous.

The choices and why

Model: a small, fast hosted model, because at this volume per-request cost dominates the economics.
Data: no retrieval, since classification works from the item content alone.
Glue: explicit code with careful batching and cost tracking.

What worked was treating cost per request as a first-class design constraint from the start. An earlier prototype used a larger model and looked great in testing, but the unit economics would have been ruinous at full volume. Catching that before launch saved the project, a failure mode flagged in Seven Stack Choices That Quietly Sink AI Projects.

Scenario Four: A Prototype That Over-Built Too Early

A team building an internal research assistant started with a heavy framework, a vector database, and a complex multi-step pipeline before they had a single user.

What went wrong

The elaborate stack was hard to debug, slow to change, and solved problems the actual usage never presented. Most queries did not even need retrieval. The team spent weeks maintaining complexity that delivered no value.

The correction

They tore it back to a single model call with simple code, then added components only as real usage demanded them. The rebuilt version shipped faster and broke less. The lesson is that starting minimal and growing into complexity beats starting complex and pruning.

Why pruning is harder than growing

It is worth dwelling on why the rebuild went better than the original, because the asymmetry is counterintuitive. Adding a component when real usage demands it is easy: you know exactly what problem it solves and you can verify it helps. Removing a component you added speculatively is hard, because by the time you suspect it is unnecessary, other parts of the system have grown to depend on it, and you cannot be sure tearing it out is safe. Speculative complexity entangles itself. This is the structural reason starting minimal wins: growth is additive and verifiable, while pruning is subtractive and risky. The team that over-built paid for that asymmetry in weeks of wasted work.

Reading Across the Scenarios

Lined up together, the scenarios show a consistent pattern: the constraint that dominates is the one that should drive the stack. Budget drove the first, data residency the second, per-request cost the third, and the absence of a real constraint should have driven the fourth toward simplicity.

The transferable lesson

Name your dominant constraint before choosing anything.
Let that constraint override textbook defaults when it genuinely conflicts.
Add complexity only when a real constraint demands it, never preemptively.

The right stack is rarely the most capable one. It is the one whose choices line up with the constraint that actually matters for your situation.

Frequently Asked Questions

Why did the budget scenario use a smaller model?

Because the task was constrained enough that a smaller model handled it reliably, and a frontier model would have spent budget on capability the problem never required. The constrained problem justified the cheaper choice.

When is self-hosting actually the right call?

When a hard constraint like data residency or extreme scale economics makes hosted models unworkable. It brings real operational cost, so it should be forced by necessity rather than chosen for control you do not need.

How does per-request cost change high-volume design?

At millions of requests, a small per-item cost difference compounds into the dominant line item. It pushes you toward smaller, faster models and careful batching, even when a larger model would be marginally more accurate.

What was the core mistake in the over-built prototype?

Adding complexity before any real usage justified it. Frameworks, vector stores, and pipelines became maintenance burden for problems that never materialized. Starting minimal would have shipped faster.

Are these real companies?

They are illustrative composites built from common patterns, not specific case files. The constraints and trade-offs are real and recurring even though the particular teams are not named.

How do I find my dominant constraint?

Ask what single factor, if ignored, would kill the project: budget, data residency, latency, scale, or accuracy. The factor with the hardest limit is usually the one that should drive your stack.

Key Takeaways

A tight budget can justify a smaller model when the task is constrained enough to handle it.
A hard data-residency constraint can override the hosted-model default and force self-hosting.
At high volume, per-request cost becomes the dominant design constraint and favors small, fast models.
Over-building before real usage exists creates maintenance burden with no payoff; start minimal.
Across scenarios, the dominant constraint is what should drive the stack, not the most capable option.
Identify the single factor that would kill the project if ignored, and let it lead your decisions.

Scenario One: A Support Chatbot on a Tight Budget

A small company wants a chatbot that answers customer questions from their help documentation. The volume is moderate, the answers must be grounded in their docs, and the budget is genuinely tight.

The choices and why

Model: a smaller, cheaper hosted model, because the task is constrained and a frontier model would burn budget for capability the problem does not need.
Data: retrieval over the help docs, because the answers must come from their content, not the model's general knowledge.
Glue: minimal explicit code, since the flow is a simple retrieve-then-answer.

The lesson hidden in the near-failure

Scenario Two: A Regulated Firm With Data That Cannot Leave

A financial services team wants to summarize internal documents, but compliance forbids sending that data to an external API. The textbook default of hosted models is off the table.

The choices and why

Model: a self-hosted open model, because data residency is a hard, non-negotiable constraint.
Data: retrieval over internal documents, kept entirely inside their environment.
Glue: more substantial, since self-hosting brings operational responsibilities the previous scenario avoided.

The trap this scenario avoided

Scenario Three: A High-Volume Classification Task

A product team needs to classify millions of incoming items per day into categories. Accuracy matters, but so does the per-item cost, because volume is enormous.

The choices and why

Model: a small, fast hosted model, because at this volume per-request cost dominates the economics.
Data: no retrieval, since classification works from the item content alone.
Glue: explicit code with careful batching and cost tracking.

Scenario Four: A Prototype That Over-Built Too Early

A team building an internal research assistant started with a heavy framework, a vector database, and a complex multi-step pipeline before they had a single user.

What went wrong

The correction

Why pruning is harder than growing

Reading Across the Scenarios

The transferable lesson

Name your dominant constraint before choosing anything.
Let that constraint override textbook defaults when it genuinely conflicts.
Add complexity only when a real constraint demands it, never preemptively.

The right stack is rarely the most capable one. It is the one whose choices line up with the constraint that actually matters for your situation.

Frequently Asked Questions

Why did the budget scenario use a smaller model?

When is self-hosting actually the right call?

How does per-request cost change high-volume design?

What was the core mistake in the over-built prototype?

Are these real companies?

They are illustrative composites built from common patterns, not specific case files. The constraints and trade-offs are real and recurring even though the particular teams are not named.

How do I find my dominant constraint?

Ask what single factor, if ignored, would kill the project: budget, data residency, latency, scale, or accuracy. The factor with the hardest limit is usually the one that should drive your stack.

Key Takeaways

A tight budget can justify a smaller model when the task is constrained enough to handle it.
A hard data-residency constraint can override the hosted-model default and force self-hosting.
At high volume, per-request cost becomes the dominant design constraint and favors small, fast models.
Over-building before real usage exists creates maintenance burden with no payoff; start minimal.
Across scenarios, the dominant constraint is what should drive the stack, not the most capable option.
Identify the single factor that would kill the project if ignored, and let it lead your decisions.

Concrete AI Stack Decisions, Walked Through Front to Back

Scenario One: A Support Chatbot on a Tight Budget

The choices and why

The lesson hidden in the near-failure

Scenario Two: A Regulated Firm With Data That Cannot Leave

The choices and why

The trap this scenario avoided

Scenario Three: A High-Volume Classification Task

The choices and why

Scenario Four: A Prototype That Over-Built Too Early

What went wrong

The correction

Why pruning is harder than growing

Reading Across the Scenarios

The transferable lesson

Frequently Asked Questions

Why did the budget scenario use a smaller model?

When is self-hosting actually the right call?

How does per-request cost change high-volume design?

What was the core mistake in the over-built prototype?

Are these real companies?

How do I find my dominant constraint?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Concrete AI Stack Decisions, Walked Through Front to Back

Scenario One: A Support Chatbot on a Tight Budget

The choices and why

The lesson hidden in the near-failure

Scenario Two: A Regulated Firm With Data That Cannot Leave

The choices and why

The trap this scenario avoided

Scenario Three: A High-Volume Classification Task

The choices and why

Scenario Four: A Prototype That Over-Built Too Early

What went wrong

The correction

Why pruning is harder than growing

Reading Across the Scenarios

The transferable lesson

Frequently Asked Questions

Why did the budget scenario use a smaller model?

When is self-hosting actually the right call?

How does per-request cost change high-volume design?

What was the core mistake in the over-built prototype?

Are these real companies?

How do I find my dominant constraint?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?