This is the story of one system, told end to end: the situation it started in, the decision its team faced, what they actually built, the measurable result, and the lessons that generalize. The product is a composite drawn from common patterns, with illustrative numbers, but every problem and fix in it is one teams hit repeatedly. The point is to show how context-length thinking plays out as a sequence of real choices, not a checklist.
The system was an internal research assistant. Analysts asked it questions, and it answered using a library of about 2,500 internal reports, each 5,000 to 15,000 tokens, plus the ongoing conversation. It ran on a model with a 128,000-token window. On paper that window sounded generous. In practice the assistant was quietly unreliable, and nobody could say why.
The Situation: Confident, Wrong, and Unpredictable
The complaints were consistent. The assistant gave answers that sounded authoritative but cited details that were not in the reports, or missed details that clearly were. Worse, the same question asked twice could yield different quality. There were no errors in the logs. The system ran clean and lied.
A first instinct, common and wrong, was to blame the model and propose upgrading to something with a larger window. The team paused to diagnose instead. They instrumented the assistant to log the exact token count and composition of every prompt it sent.
What they found explained everything:
- The system pasted as many reports as would fit into the 128,000-token window per query, often a dozen documents.
- Conversation history was appended unbounded and never summarized.
- There was no relevance ranking; reports were included in whatever order the database returned them.
The window was not too small. It was being filled with mostly irrelevant content, while the genuinely relevant report sat somewhere in the middle, in the attention dead zone.
The Decision: Retrieve and Prioritize, Do Not Enlarge
The team had two paths. Path one: upgrade to a larger window and keep stuffing. Path two: stop stuffing and send only relevant, well-ordered content.
They chose path two, for reasons spelled out in the best practices article. A larger window would have made the bill worse and the lost-in-the-middle problem more severe, not less. The real issue was relevance density: the fraction of the prompt that actually pertained to the question was tiny. No window size fixes low relevance density. Retrieval does.
The decision, stated plainly: rebuild context assembly around retrieval and explicit prioritization, on the existing model.
The Execution: Four Concrete Changes
Chunk and index the report library
The 2,500 reports were split into passages along section boundaries, each self-contained, a few hundred tokens. Splitting on sections rather than fixed length mattered; an earlier fixed-length attempt cut tables and procedures mid-step and produced exactly the confident wrong answers they were trying to kill. This mirrors the chunking lesson from the real-world examples.
Retrieve only the top passages per query
At query time, the system pulled the handful of passages most relevant to the question, around 6,000 tokens total, instead of a dozen whole reports. Relevance density went from a few percent to the majority of the prompt.
Bound and summarize conversation history
History was capped on a token threshold. Once it crossed 60 percent of the working budget, older turns were summarized into a running synopsis while recent turns stayed verbatim. This followed the trigger described in the framework article.
Add a prioritized assembly function and a pre-send guard
All prompt construction moved into one function that ordered content by priority: system instructions first, the user's question last, retrieved passages ranked by relevance in between. A guard counted the assembled prompt before every call and shed the lowest-ranked passages if it ran over budget. The step-by-step approach describes this guard pattern directly.
The Outcome: Measurable and Clear
The team tracked answer accuracy with a fixed evaluation set of questions whose correct answers they knew.
- Average prompt size dropped from roughly 110,000 tokens to about 9,000.
- Per-query cost fell by a large multiple, because they were no longer paying long-context prices for irrelevant text.
- Latency improved noticeably, since the model processed far less input.
- Answer accuracy on the evaluation set rose substantially, and the maddening run-to-run variability largely disappeared.
The single most telling result: they never changed the model. The same model that produced confident wrong answers produced reliable ones once the context was budgeted and prioritized. The intelligence was always there; the system was drowning it in noise.
The Lessons That Generalize
Several takeaways transfer to almost any context-bound system.
- A large window is not a license to fill it. Relevance density, not raw size, drives answer quality.
- The dead zone is real. Burying the relevant passage in the middle of a stuffed prompt is functionally the same as not including it.
- Silent failures need instrumentation. Nothing in the logs flagged the problem until the team logged prompt composition. The fix started with measurement.
- Retrieval beats enlargement for low relevance density. Sending less, better-ranked content outperformed sending more.
- One assembly path with priorities makes degradation graceful. When the budget is tight, the system sheds noise, not signal.
The common mistakes guide catalogs each of these failures in isolation; this case study is what they look like when they all hit one system at once.
What the Team Almost Got Wrong
It is worth dwelling on the path not taken, because it is the path most teams default to. The original proposal was a one-line fix: swap in a model with a larger window. That change would have shipped in an afternoon, looked productive, and made everything worse. The bill would have climbed, latency would have grown, and the relevant passage would have sat even deeper inside an even larger stuffed prompt. The team would have spent more money to make the core problem harder to see.
What saved them was a discipline they imposed before touching anything: measure first, change second. They refused to act on the intuition that the window was too small until the logs proved otherwise, and the logs proved the opposite. That sequence, diagnose before prescribing, is the most portable lesson in the whole story. Context-length problems masquerade as model problems, and the only way to tell them apart is instrumentation.
How They Validated the Fix
The team did not trust their own optimism. They built a fixed evaluation set of representative questions with known correct answers and ran it before and after every change. That gave them a hard accuracy number rather than a vibe. They also kept the prompt-composition logging in place permanently, so that future regressions, a new feature quietly stuffing the window again, would surface in the metrics rather than in user complaints. The evaluation set plus the logging turned context quality into something they could defend with data in a review, not assert from memory.
Frequently Asked Questions
Why not just upgrade to a model with a bigger window?
Because the window was never the constraint. The prompt was full of irrelevant content, so a bigger window would have let the team stuff in even more noise while making cost, latency, and the lost-in-the-middle effect worse. The fix was sending less, more relevant content.
How did the team discover the real problem?
By instrumenting the assistant to log the exact token count and composition of every prompt. The logs were clean of errors, so the problem was invisible until they could see that prompts were huge and dominated by irrelevant reports. Diagnosis started with measurement.
What made the retrieval rebuild succeed where stuffing failed?
Relevance density. Retrieval sent about 6,000 tokens of passages chosen for the specific question, so most of the prompt was actually relevant. Chunking along section boundaries kept each passage self-contained, which eliminated the mid-procedure splits that had caused wrong answers.
Did answer quality really improve without changing the model?
Yes. Accuracy on a fixed evaluation set rose substantially and run-to-run variability dropped, using the same model throughout. The model's reasoning had always been adequate; it was being buried under irrelevant context that retrieval and prioritization removed.
What was the most important structural change?
Moving all prompt construction into a single prioritized assembly function with a pre-send guard. That ensured the system always sent the highest-priority content and, when over budget, shed the least relevant passages rather than truncating arbitrarily.
Key Takeaways
- The assistant's confident wrong answers came from stuffing a large window with mostly irrelevant reports, not from a weak model.
- The relevant passage was buried in the prompt's middle, where attention is weakest, effectively erasing it.
- Diagnosis required instrumenting prompt size and composition, because logs showed no errors.
- Retrieval of well-ranked, section-aligned passages raised relevance density from a few percent to the majority of the prompt.
- Bounding history with token-threshold summarization and a prioritized assembly guard made the system degrade gracefully.
- Accuracy rose and variability fell with no model change, cutting average prompt size from about 110,000 to 9,000 tokens.