Context limits feel abstract until you watch them shape a real system. The same window size that is plenty for one use case is hopeless for another, and the right strategy depends entirely on the shape of the content. This article works through specific scenarios, with rough token numbers, to show what made each one fit or fail.
These are representative patterns drawn from common application types, not a single company's data. The numbers are illustrative but realistic, chosen to make the trade-offs visible. As you read, notice that the deciding factor is almost never raw model intelligence. It is how the content maps onto the available budget.
For the underlying concepts, the complete guide covers tokens, the shared window, and the three core strategies these examples apply.
Example 1: A Customer Support Chatbot That Forgot
A support team built a chatbot on a model with a 16,000-token window. It worked beautifully in testing, where conversations were short. In production, longer troubleshooting sessions began losing the thread; by the twentieth exchange, the bot would re-ask for the customer's order number it had been given ten turns earlier.
The cause was unbounded history. Each turn added to the prompt until the oldest turns, including the order number, were silently dropped to fit.
What fixed it: A rolling summary. Once history crossed roughly 9,000 tokens, older turns were compressed into a synopsis that explicitly preserved key facts like the order number, while recent turns stayed verbatim. The bot stopped forgetting because the important details now survived compression by design. This is the summarize strategy from the framework article in action.
Example 2: Contract Analysis That Fit, Barely
A legal-tech tool summarized commercial contracts. Most contracts ran 20,000 to 30,000 tokens, comfortably inside a 100,000-token window. The team correctly chose to send each contract whole rather than chunk it, because a contract must be understood as a unit. Clauses reference each other; retrieving fragments would miss cross-references.
The lesson here is the inverse of Example 1. When a document is a coherent whole and fits with margin, fitting it is the right call, and retrieval would actively hurt by fragmenting the very relationships that matter.
Where it got tense: A handful of contracts with extensive exhibits pushed past 80,000 tokens. The team added a pre-send guard that detected oversize contracts and routed them to a clause-by-clause pass instead of failing. The step-by-step approach describes exactly this kind of guard.
Example 3: Searching a 4,000-Page Knowledge Base
An internal help desk wanted answers drawn from a 4,000-page documentation set, far larger than any window. There was never a question of fitting it; the entire corpus was perhaps 3 million tokens.
The approach: Retrieval. The docs were split into passages of a few hundred tokens each, indexed, and at query time the system pulled the top handful of relevant passages, roughly 3,000 tokens total, into the prompt. The model answered from those passages.
What made it work: Chunk quality and ranking. Early versions chunked by fixed character count, splitting tables and procedures mid-step, which produced confident wrong answers. Re-chunking along section boundaries, so each chunk was self-contained, fixed most of the errors. The window was never the bottleneck; retrieval quality was.
Example 4: A Code Assistant Misjudging Token Counts
A team building a code-review assistant estimated capacity using the three-quarters-of-a-word rule. Their test files fit fine. In production, files dense with minified code, long import blocks, and JSON config tokenized at nearly twice their estimate, and requests started getting rejected.
The fix: Switching from estimation to measuring with the model's actual tokenizer, plus lowering the per-request file limit to leave margin. The deeper lesson, covered in the common mistakes guide, is that code and structured data break the word-count heuristic badly.
Example 5: Choosing the Wrong Window Size
A high-volume product moderated short user posts, a few hundred tokens each, at large scale. The team defaulted to a model with a very large window because it seemed safest. The bill was several times higher than necessary, and latency was worse, because they were paying long-context prices for tiny inputs.
The fix: Moving the moderation task to a smaller, faster window sized to the actual content. Cost dropped sharply with no loss in quality. The reasoning, expanded in the best practices article, is that window size should match the job, and bigger is not free.
Comparing the Strategies Side by Side
Lined up, the pattern is clear:
- Fit it worked for contracts: a coherent document inside the window with margin.
- Summarize it worked for the support chatbot: growing history where the gist matters more than exact wording.
- Retrieve it worked for the knowledge base: a corpus orders of magnitude larger than any window.
- Measure, do not estimate mattered everywhere, but especially for code and structured data.
- Right-size the window saved money where inputs were small and volume was high.
None of these is a default. Each example's outcome followed from matching the strategy to the shape of the content and the budget available.
Example 6: A Meeting Assistant That Summarized Itself Into Trouble
A meeting-notes assistant transcribed and summarized hour-long calls. Transcripts ran 12,000 to 18,000 tokens, well inside the window, so the team sent each whole and asked for a summary. That worked. The trouble came when users started asking follow-up questions about past meetings in the same thread.
Each follow-up reappended the full transcript plus the growing question-and-answer history. By the fourth or fifth follow-up, the thread had accumulated multiple full transcripts and was brushing the window ceiling, with answers starting to clip.
What fixed it: Summarizing each transcript once, up front, and keeping only the summary in the running context. The full transcript was retrieved on demand only when a question genuinely required verbatim detail. This combined the summarize and retrieve strategies: compress by default, fetch the source when precision was needed. It is a reminder that even content which fits individually can overflow when it accumulates across a conversation.
A Note on Mixed Workloads
Real products rarely fit one pattern. A single assistant might handle short chats, occasional large document uploads, and corpus search in the same session. The systems that hold up treat each path separately: small window for chat, fit-or-route for documents, retrieval for corpus search, all behind one interface. Trying to force every path through a single oversized window is how the mistakes in Example 4 and Example 5 happen at once. The deciding habit, across all six examples, is to look at the shape of the content before choosing a strategy, never the other way around.
Frequently Asked Questions
Why did fitting the whole document work for contracts but not the knowledge base?
A contract is a coherent unit whose clauses reference each other, and it fit inside the window with margin, so sending it whole preserved those relationships. The knowledge base was millions of tokens, orders of magnitude larger than any window, so retrieval was the only option that scaled.
How did the support chatbot keep forgetting if the model was capable?
Model capability was never the issue. The conversation history grew past the window, so the oldest turns, including key facts, were dropped to make room. Summarizing older turns while preserving important details fixed it without any model change.
Why did the code assistant's token estimates fail?
Because it estimated from word count, and code with minified blocks, long imports, and JSON tokenizes at far more tokens per visible character than prose. Measuring with the actual tokenizer and lowering the file limit restored reliability.
Is a bigger window ever the wrong choice?
Yes. For the high-volume moderation case, the oversized window multiplied cost and latency for inputs that were only a few hundred tokens. Matching window size to the actual content size saved money with no quality loss.
What is the single most common cause of these failures?
Mismatching the strategy to the shape of the content. Unbounded history needs summarization, huge corpora need retrieval, coherent fitting documents should be sent whole, and small inputs deserve small windows. Most failures trace back to applying the wrong one.
Key Takeaways
- Unbounded conversation history caused a chatbot to forget; rolling summarization that preserved key facts fixed it.
- A coherent document that fits with margin should be sent whole; retrieval would fragment important cross-references.
- A corpus orders of magnitude larger than any window requires retrieval, and chunk quality matters more than window size.
- Code and structured data tokenize far heavier than prose, so measure with the real tokenizer instead of estimating.
- Defaulting to the largest window wastes cost and latency on small inputs; match window size to the content.
- Mixed workloads should route each content type to its own strategy behind one interface.