How One Team Rebuilt a Failing AI Assistant

The fastest way to understand context engineering is to watch it rescue something that was failing. This case study follows a single AI feature—an internal support assistant—from the moment it was nearly scrapped through the redesign that made it dependable. The organization is composite and the details illustrative, but every decision reflects patterns that recur across real teams.

The arc is deliberately complete: the situation as it stood, the diagnosis that reframed the problem, the decisions that followed, the execution, and the measurable outcome. The point is not the specific feature but the method, which transfers to almost any context-driven AI system you might build.

What makes this worth reading is that the team's first three instincts were all wrong, and each wrong instinct is a common one. Watching them get redirected toward the actual cause is the lesson.

The Situation

A company had built an assistant to help its support staff answer customer questions by drawing on internal documentation. In testing it impressed everyone. In production it embarrassed them.

The Symptoms

Staff reported confident answers that cited outdated policies, occasional invented procedures, and a tendency to lose track during longer exchanges. Trust eroded quickly, and several agents stopped using it entirely.

The First Instincts

The team's first three proposals were to switch to a larger model, rewrite the system prompt, and lower the temperature setting. All plausible, all aimed at the model rather than the context. Each proposal also carried a cost: a larger model meant higher spend on every request, a prompt rewrite meant days of trial and error, and a temperature change risked making correct answers less varied without addressing why wrong ones appeared. None of the three had any evidence behind it beyond the shared assumption that a misbehaving AI must mean a weak model.

Why the Assumption Was Tempting

The assumption felt reasonable because the demo had worked. If the same model performed well in testing and poorly in production, the natural conclusion was that production was somehow harder and needed more horsepower. The flaw in that reasoning was invisible until someone looked: production was not harder for the model, it was harder for the context pipeline, which behaved differently against real data than against the curated test inputs.

The Diagnosis

Before spending on a bigger model, one engineer insisted on a simple step: read the exact context the assistant received for ten real failures.

What the Context Revealed

Retrieval was returning superseded policy documents alongside current ones
The rule to answer only from current policy sat buried mid-context
Long conversations had pushed the system instructions out of the window entirely

None of the failures were model intelligence problems. Every one was a context problem. This reframing is the heart of the discipline described in Master Context Engineering Without Guesswork.

The Decisions

With the real causes visible, the team set three priorities in order of impact.

Fix Retrieval First

Because retrieval set the ceiling on every answer, it came first. The team tagged documents with effective dates and filtered retrieval to current versions only. This is precisely the reasoning argued in Context Engineering Habits That Hold Up in Production.

Reposition Critical Rules

The answer-only-from-current-policy rule moved to the start of the system block and was restated immediately before each question, occupying high-attention positions.

Manage Conversation History

Verbatim history was replaced with a running summary that preserved the customer's account, intent, and prior resolutions while dropping filler, so the system instructions never fell out again.

The Execution

The team resisted doing everything at once. They built a regression set of the original ten failures plus fifteen typical and adversarial cases, then applied changes one at a time.

Measuring Each Change

After the retrieval fix, stale-policy errors in the regression set disappeared. After repositioning, the model stopped answering from outside the provided documents. After history management, the long-conversation failures resolved. Each change was verified before the next began, so no fix masked a new regression. The step-by-step method they followed mirrors Build Reliable Context One Step at a Time.

What They Did Not Change

Notably, they never switched the model, rewrote the prompt wholesale, or touched temperature. The original first instincts would have cost money and changed nothing, because none addressed the actual causes. This is the quiet result that makes the case worth telling: the three changes the team almost made would have looked like progress, consumed real budget, and left every failure mode in place. Restraint—doing only what the evidence justified—was as important as the fixes themselves.

The Outcome

The regression set, which had started with the assistant failing most cases, ended with it passing all of them. More importantly, the failure modes that had eroded trust—stale citations, invented procedures, lost context—stopped appearing in daily use.

The Lasting Change

The team kept the regression set as a living artifact. Every new failure reported by staff was reproduced, traced to context, fixed, and added to the set, so the same problem could not silently return. The assistant moved from abandoned to relied upon without a single model upgrade.

The Transferable Lesson

The turnaround came entirely from controlling what the model could see. That is the whole of context engineering, and it generalizes far beyond support assistants. The specific traps they hit are catalogued in 7 Common Mistakes with Context Engineering.

What the Team Would Do Differently

Looking back, the team identified choices that would have saved weeks if made at the start.

Inspect Context Before Spending

Their first reflex was to budget for a larger model. Had they read the failing contexts on day one, they would have seen immediately that capability was never the issue. The cheapest diagnostic step—reading what the model actually received—was the one they nearly skipped, and it was the one that mattered.

Build the Regression Set Earlier

The regression set proved so valuable in the redesign that the team wished it had existed from the first release. With it, the original failures would have been caught before reaching staff, and the erosion of trust that nearly killed the project would never have started.

Order Fixes by Leverage

Fixing retrieval first, then positioning, then history was not arbitrary—it followed the order in which each layer constrained the others. Tackling the highest-leverage layer first meant later fixes built on solid ground instead of compensating for an unstable one. Recognizing that ordering up front would have made the redesign faster still.

Frequently Asked Questions

Why did inspecting the context matter so much?

It replaced speculation with evidence. The team's first instincts blamed the model, but reading the actual context for failing cases showed the model was reasoning correctly over flawed inputs. Inspection redirected effort from expensive, ineffective changes to the cheap, effective ones that addressed real causes.

Why fix retrieval before anything else?

Retrieval sets the ceiling on answer quality. If the model is grounding on the wrong documents, improving instructions, positioning, or the model itself cannot help, because the underlying evidence is wrong. Fixing the highest-leverage layer first prevents wasted effort on layers that depend on it.

Would a bigger model have helped at all?

Not with these failures. Every problem was a context gap—stale retrieval, a misplaced rule, lost history. A larger model would have reasoned over the same flawed context and produced the same kinds of errors, at higher cost. Capability was never the bottleneck.

How did the regression set change the team's process?

It turned every fix into a permanent guarantee and every new failure into a reusable test. Instead of fixing problems ad hoc and hoping they stayed fixed, the team accumulated proof that earlier wins survived later changes. The set became the backbone of ongoing reliability.

Is this method specific to support assistants?

No. The method—inspect context for failures, fix the highest-leverage layer first, verify with a regression set—applies to any AI feature whose behavior depends on assembled context. The support assistant is just a concrete vehicle for a general discipline.

Key Takeaways

The team's first instincts blamed the model; the real causes were all context
Reading the exact context for failing cases reframed the entire problem
Retrieval was fixed first because it sets the ceiling on answer quality
Repositioning a rule and summarizing history resolved the remaining failures
No model upgrade, prompt rewrite, or temperature change was needed
A living regression set turned the fix into lasting, verifiable reliability

What makes this worth reading is that the team's first three instincts were all wrong, and each wrong instinct is a common one. Watching them get redirected toward the actual cause is the lesson.

The Situation

A company had built an assistant to help its support staff answer customer questions by drawing on internal documentation. In testing it impressed everyone. In production it embarrassed them.

The Symptoms

The First Instincts

Why the Assumption Was Tempting

The Diagnosis

Before spending on a bigger model, one engineer insisted on a simple step: read the exact context the assistant received for ten real failures.

What the Context Revealed

Retrieval was returning superseded policy documents alongside current ones
The rule to answer only from current policy sat buried mid-context
Long conversations had pushed the system instructions out of the window entirely

None of the failures were model intelligence problems. Every one was a context problem. This reframing is the heart of the discipline described in Master Context Engineering Without Guesswork.

The Decisions

With the real causes visible, the team set three priorities in order of impact.

Fix Retrieval First

Reposition Critical Rules

The answer-only-from-current-policy rule moved to the start of the system block and was restated immediately before each question, occupying high-attention positions.

Manage Conversation History

Verbatim history was replaced with a running summary that preserved the customer's account, intent, and prior resolutions while dropping filler, so the system instructions never fell out again.

The Execution

The team resisted doing everything at once. They built a regression set of the original ten failures plus fifteen typical and adversarial cases, then applied changes one at a time.

Measuring Each Change

What They Did Not Change

The Outcome

The Lasting Change

The Transferable Lesson

What the Team Would Do Differently

Looking back, the team identified choices that would have saved weeks if made at the start.

Inspect Context Before Spending

Build the Regression Set Earlier

Order Fixes by Leverage

Frequently Asked Questions

Why did inspecting the context matter so much?

Why fix retrieval before anything else?

Would a bigger model have helped at all?

How did the regression set change the team's process?

Is this method specific to support assistants?

Key Takeaways

The team's first instincts blamed the model; the real causes were all context
Reading the exact context for failing cases reframed the entire problem
Retrieval was fixed first because it sets the ceiling on answer quality
Repositioning a rule and summarizing history resolved the remaining failures
No model upgrade, prompt rewrite, or temperature change was needed
A living regression set turned the fix into lasting, verifiable reliability

How One Team Rebuilt a Failing AI Assistant

The Situation

The Symptoms

The First Instincts

Why the Assumption Was Tempting

The Diagnosis

What the Context Revealed

The Decisions

Fix Retrieval First

Reposition Critical Rules

Manage Conversation History

The Execution

Measuring Each Change

What They Did Not Change

The Outcome

The Lasting Change

The Transferable Lesson

What the Team Would Do Differently

Inspect Context Before Spending

Build the Regression Set Earlier

Order Fixes by Leverage

Frequently Asked Questions

Why did inspecting the context matter so much?

Why fix retrieval before anything else?

Would a bigger model have helped at all?

How did the regression set change the team's process?

Is this method specific to support assistants?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

How One Team Rebuilt a Failing AI Assistant

The Situation

The Symptoms

The First Instincts

Why the Assumption Was Tempting

The Diagnosis

What the Context Revealed

The Decisions

Fix Retrieval First

Reposition Critical Rules

Manage Conversation History

The Execution

Measuring Each Change

What They Did Not Change

The Outcome

The Lasting Change

The Transferable Lesson

What the Team Would Do Differently

Inspect Context Before Spending

Build the Regression Set Earlier

Order Fixes by Leverage

Frequently Asked Questions

Why did inspecting the context matter so much?

Why fix retrieval before anything else?

Would a bigger model have helped at all?

How did the regression set change the team's process?

Is this method specific to support assistants?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?