One Team's Path Through an AI Stack Decision

A case study is more useful than a checklist because it shows the decisions in motion, including the wrong turns. Real stack choices are not made in the clean order a guide implies; they are made under pressure, with incomplete information, and revised when reality pushes back. Watching a team work through that mess teaches things a tidy framework cannot.

This is the story of one team, a mid-sized operations group, choosing an AI tech stack for a document-processing tool. The team and specifics are an illustrative composite rather than a named account, but the arc is the real shape these projects take: a stalled situation, a forced decision, a rocky execution, a measurable result, and lessons that only became clear in hindsight. Follow the arc and the abstract advice from other articles becomes concrete.

The Situation: Stalled and Drowning in Documents

The operations team processed thousands of supplier documents by hand each month, extracting a handful of fields from each. It was slow, error-prone, and the volume was growing faster than the team. Someone proposed an AI tool, and the project promptly stalled because nobody knew how to choose the stack.

Why it stalled

Every team member had a different favorite model from something they had read.
Nobody had written down what the tool actually needed to do.
The conversation kept circling tools instead of the problem.

The stall was not a tooling problem. It was the absence of a starting point, the exact gap that a clear problem statement closes.

The Decision: Writing the Problem Down

The breakthrough came from refusing to discuss tools for one meeting and instead writing the problem in a single paragraph. The task was field extraction from semi-structured documents. The accuracy bar was high because errors flowed into payments. The volume was a few thousand documents a day. The budget was modest.

How the paragraph decided things

Once written, the paragraph did most of the selecting. The accuracy bar and moderate volume pointed to a capable hosted model rather than the cheapest option or a self-hosted build. The need to extract from the documents' own content pointed to including their text directly rather than relying on general knowledge. The discipline mirrored Step by Step Through an AI Tech Stack Decision.

The Execution: Rockier Than Planned

The first build was a hosted model, the document text fed in directly, and simple code as glue. It worked on clean documents and fell apart on messy ones, which were the majority. The team's instinct was to switch to a more powerful model.

The wrong turn and the recovery

Switching models barely helped, because the problem was not model capability. It was that messy documents needed better input handling before the model ever saw them. The team added a preprocessing step to normalize the document text, and accuracy jumped. The lesson, that input quality often beats model capability, echoes the data-layer priority in Everything That Goes Into an AI Tech Stack Decision.

Why the model reflex is so common

The team's first instinct to reach for a bigger model is worth examining, because almost everyone has it. When an AI system underperforms, the model is the most visible component and the easiest to swap, so it becomes the default suspect. But the model is rarely where the problem lives. More often the issue is upstream, in what the model is being fed, or downstream, in how the output is being checked. The team here lost time chasing the visible suspect before examining the actual culprit. The habit worth building is to ask, before touching the model, whether the input was clean and the question well-posed, because fixing those is usually cheaper and more effective than upgrading the engine.

Adding the Missing Piece: Observability

The early version had no way to see why a given extraction failed. When an error reached a payment, the team could not trace it. They added logging of each document, the prompt, and the extracted result, plus a daily sample review.

What observability changed

Failures became diagnosable from evidence instead of guesswork.
The sample review caught a category of silent errors the team had not known existed.
Trust in the tool grew because problems were visible and fixable.

The team later said skipping observability at the start was their biggest regret, because the early blind period cost them weeks of confusion.

How the blind period actually hurt

It is worth being concrete about what the missing observability cost. During the early weeks, a supplier complained that a payment was wrong. The team wanted to know which document produced the bad extraction and why, but they had logged nothing, so they could not reconstruct what the system had seen. They ended up re-running documents by hand to reproduce the error, a process that took days and shook confidence in the tool just as it was gaining adoption. Once logging existed, the same class of complaint became a five-minute lookup: pull the document, see the prompt and the extracted result, and identify the cause. The contrast between days of guesswork and a five-minute lookup is the entire argument for building observability before you need it.

The Outcome: Measurable and Real

After the preprocessing fix and observability addition, the tool processed the daily volume with an error rate below the manual baseline, and the team that had been drowning was reassigned to higher-value work. The cost per document, tracked from soon after launch, stayed within the modest budget.

The numbers that mattered

Processing time per document dropped dramatically against the manual baseline.
The error rate fell below what humans had achieved by hand.
Cost per document stayed within budget at full volume.

These were not hypothetical projections. They were measured, which was only possible because observability had finally been built in. A claim of success you can put a number behind is worth far more than a vague sense that the tool is helping, both for the team's own confidence and for justifying the project to the people who funded it.

The Lessons, in Hindsight

The team distilled their experience into a few hard-won lessons. Write the problem before discussing tools. When output is bad, suspect the input before the model. Build observability from the start, not after the first painful incident. And let measured results, not intuition, tell you whether the stack is working.

What they would do differently

Start with observability rather than adding it under duress.
Resist the reflex to fix quality by upgrading the model.
Trust the written problem statement to settle tool debates faster.

The arc from stalled to shipped came down to discipline, not cleverness. The tools were never the hard part. The hard part was the order of decisions and the willingness to suspect the right layer when things broke.

Frequently Asked Questions

Why did writing the problem down break the stall?

Because the stall came from debating tools with no shared target. A written problem statement gave the team a fixed specification to measure options against, which settled debates that intuition kept reopening.

Why did switching to a more powerful model not help?

Because the failures came from messy input the model never had a fair chance with, not from a lack of model capability. Fixing the input through preprocessing solved what a bigger model could not.

What did observability actually change?

It turned untraceable failures into diagnosable ones and surfaced a category of silent errors the team had not known existed. It also made the measurable outcomes possible, since you cannot measure what you do not log.

Was the outcome really measurable?

Yes, because logging and cost tracking were in place. Processing time, error rate against the manual baseline, and cost per document were all measured rather than estimated.

Is this a real company?

It is an illustrative composite drawn from common patterns, not a named account. The arc and the lessons reflect what these projects genuinely look like.

What was the team's single biggest regret?

Skipping observability at the start. The early blind period cost them weeks of confusion that thorough logging would have prevented, which is why they would build it in first next time.

Key Takeaways

The project stalled on the absence of a problem statement, not on a tooling gap.
Writing the problem in one paragraph let it settle tool debates and select the stack.
When output was poor, the fix was better input handling, not a more powerful model.
Skipping observability created a blind period the team called their biggest regret.
Measurable outcomes in time, error rate, and cost were only possible because logging existed.
The decisive factor was the order of decisions and suspecting the right layer, not cleverness.

The Situation: Stalled and Drowning in Documents

Why it stalled

Every team member had a different favorite model from something they had read.
Nobody had written down what the tool actually needed to do.
The conversation kept circling tools instead of the problem.

The stall was not a tooling problem. It was the absence of a starting point, the exact gap that a clear problem statement closes.

The Decision: Writing the Problem Down

How the paragraph decided things

The Execution: Rockier Than Planned

The wrong turn and the recovery

Why the model reflex is so common

Adding the Missing Piece: Observability

What observability changed

Failures became diagnosable from evidence instead of guesswork.
The sample review caught a category of silent errors the team had not known existed.
Trust in the tool grew because problems were visible and fixable.

The team later said skipping observability at the start was their biggest regret, because the early blind period cost them weeks of confusion.

How the blind period actually hurt

The Outcome: Measurable and Real

The numbers that mattered

Processing time per document dropped dramatically against the manual baseline.
The error rate fell below what humans had achieved by hand.
Cost per document stayed within budget at full volume.

The Lessons, in Hindsight

What they would do differently

Start with observability rather than adding it under duress.
Resist the reflex to fix quality by upgrading the model.
Trust the written problem statement to settle tool debates faster.

Frequently Asked Questions

Why did writing the problem down break the stall?

Why did switching to a more powerful model not help?

Because the failures came from messy input the model never had a fair chance with, not from a lack of model capability. Fixing the input through preprocessing solved what a bigger model could not.

What did observability actually change?

Was the outcome really measurable?

Yes, because logging and cost tracking were in place. Processing time, error rate against the manual baseline, and cost per document were all measured rather than estimated.

Is this a real company?

It is an illustrative composite drawn from common patterns, not a named account. The arc and the lessons reflect what these projects genuinely look like.

What was the team's single biggest regret?

Skipping observability at the start. The early blind period cost them weeks of confusion that thorough logging would have prevented, which is why they would build it in first next time.

Key Takeaways

The project stalled on the absence of a problem statement, not on a tooling gap.
Writing the problem in one paragraph let it settle tool debates and select the stack.
When output was poor, the fix was better input handling, not a more powerful model.
Skipping observability created a blind period the team called their biggest regret.
Measurable outcomes in time, error rate, and cost were only possible because logging existed.
The decisive factor was the order of decisions and suspecting the right layer, not cleverness.

One Team's Path Through an AI Stack Decision

The Situation: Stalled and Drowning in Documents

Why it stalled

The Decision: Writing the Problem Down

How the paragraph decided things

The Execution: Rockier Than Planned

The wrong turn and the recovery

Why the model reflex is so common

Adding the Missing Piece: Observability

What observability changed

How the blind period actually hurt

The Outcome: Measurable and Real

The numbers that mattered

The Lessons, in Hindsight

What they would do differently

Frequently Asked Questions

Why did writing the problem down break the stall?

Why did switching to a more powerful model not help?

What did observability actually change?

Was the outcome really measurable?

Is this a real company?

What was the team's single biggest regret?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

One Team's Path Through an AI Stack Decision

The Situation: Stalled and Drowning in Documents

Why it stalled

The Decision: Writing the Problem Down

How the paragraph decided things

The Execution: Rockier Than Planned

The wrong turn and the recovery

Why the model reflex is so common

Adding the Missing Piece: Observability

What observability changed

How the blind period actually hurt

The Outcome: Measurable and Real

The numbers that mattered

The Lessons, in Hindsight

What they would do differently

Frequently Asked Questions

Why did writing the problem down break the stall?

Why did switching to a more powerful model not help?

What did observability actually change?

Was the outcome really measurable?

Is this a real company?

What was the team's single biggest regret?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?