Advice about prompt compression stays abstract until you see a real prompt shrink. Principles like compress what repeats and prefer selection over condensation make sense in the abstract but click only when you watch them applied to a concrete prompt with a concrete result. This article walks through six scenarios drawn from common patterns, each showing what was cut, what was deliberately kept, and whether the compression held up.
The scenarios deliberately include a failure, because the failures teach as much as the wins. A compression that looked clean and quietly broke an edge case is more instructive than three that worked. Read each for the decision, not just the outcome.
None of these depend on special tooling. They are the everyday situations where prompts get long and compression earns its keep. For each one, pay attention to two things: what the team removed and, just as importantly, what they chose to keep. The keep decisions are where the judgment lives, because almost anyone can delete tokens—the skill is knowing which tokens were carrying the task.
Scenario 1: The Bloated System Prompt
A support assistant ran a system prompt of several long paragraphs on every request.
What was done
- The three-paragraph tone-and-style guidance collapsed into five terse rules.
- A repeated instruction stated in two places was merged into one.
- A polite preamble asking the model to be helpful was deleted entirely.
The result
Token count dropped by roughly a third, and because the prompt ran on every request, the savings compounded across all traffic. Quality held on the test set—the rules survived; only the wording shrank. This is the canonical case for the rule to compress what repeats.
Scenario 2: The Padded Retrieval Block
A research assistant retrieved twelve passages and pasted all of them into the prompt.
What was done
- Passages below a relevance threshold were dropped before they reached the prompt.
- The remaining passages were reordered so the strongest sat first.
- Nothing was rewritten—this was pure selection.
The result
The evidence block shrank by half, latency improved, and answer quality actually rose because the decisive passage was no longer buried among marginal ones. Selection beat condensation here, which is why it is the recommended first move and why it connects to Retrieval-Grounded Prompting Is About to Become the Default.
Scenario 3: The Endless Conversation
A multi-turn assistant carried the entire conversation history into every turn.
What was done
- Early turns that no longer bore on the current question were dropped.
- A short running summary replaced the raw text of resolved sub-topics.
- The most recent turns were kept verbatim, since they carried live context.
The result
History tokens fell sharply per turn while continuity held, because the summary preserved what mattered and discarded only the resolved detail. The risk—summarizing away a fact a later turn needed—was managed by keeping recent turns intact.
Scenario 4: The Over-Compressed Instruction (A Failure)
A classification prompt was tightened from a full sentence to a two-word label.
What was done
- "Classify the message by intent and, if it is a complaint, flag it for escalation" became "Classify intent."
- The escalation clause was dropped as if it were filler.
The result
Tokens fell, but the model stopped flagging complaints, because the dropped clause was a constraint, not packaging. The compression looked clean and quietly broke an edge case—exactly the failure cataloged in Seven Ways Prompt Compression Quietly Backfires. The fix was to restore the constraint while keeping the tighter phrasing elsewhere.
Scenario 5: The Model-Assisted Rewrite
A long onboarding prompt was condensed by asking a model to shorten it.
What was done
- The original was handed to a model with an instruction to preserve all requirements.
- The fluent, shorter result was adopted after a check against the baseline.
The result
The rewrite was clean on most inputs but had dropped one rarely-triggered rule, caught only because the baseline test set happened to include the edge case. The lesson is not to avoid model-assisted compression but to verify it, since fluency hides omissions. This is the disciplined version of the step-by-step process applied to an automated rewrite.
Scenario 6: The Few-Shot Prompt With Too Many Examples
A data-extraction prompt included eight worked examples to teach the model the output format.
What was done
- The examples were ranked by how much each one taught beyond the others.
- Redundant examples that demonstrated the same pattern were removed.
- Two examples that covered distinct edge cases were kept, plus one canonical case.
The result
The prompt shrank substantially while extraction accuracy held, because the model had not needed eight examples to learn a format three could teach. The lesson is that examples are tokens too, and past a small number they add cost without adding instruction. The discriminating question for each example was simple: does this teach something the kept examples do not? If not, it was redundant, and redundant examples are among the easiest high-value cuts in a few-shot prompt.
Why this scenario surprises people
Teams add examples defensively—one more can only help, the thinking goes. But examples have sharply diminishing returns once the format is established, and each one is paid for on every request in a high-frequency prompt. The team here treated examples like any other content: include the ones that teach something distinct, drop the ones that merely repeat a pattern already demonstrated. Applying the selection mindset to examples, not just to retrieved passages, is what made this cut both safe and substantial.
Reading the Pattern Across All Six
Step back and the six scenarios sort into two buckets. The wins—system prompt, retrieval block, conversation, few-shot examples—all came from removing material the task genuinely did not need: filler, marginal passages, resolved history, redundant demonstrations. The failures—the over-tightened classification instruction and the model-rewrite that dropped a rule—came from removing something that looked optional but carried a constraint.
That is the whole discipline in one sentence: compress the redundant, preserve the load-bearing, and use a baseline test to tell which is which. Every scenario here is a variation on that single judgment call, and the teams that get compression right are simply the ones who make that call with measurement instead of intuition.
Frequently Asked Questions
Why did selection improve quality in the retrieval example?
Because removing marginal passages stopped the decisive one from being buried in a long context block, where models can lose track of it. Fewer, better-ranked passages let the model attend to what mattered. Selection saved tokens and sharpened focus at the same time, which is why it is the preferred first move.
How was the conversation summarized without losing context?
By keeping the most recent turns verbatim, where live context lives, and replacing only resolved sub-topics with a short running summary. The risk is summarizing away something a later turn needs, which is managed by being conservative about what counts as resolved and keeping recent turns intact.
What made the classification compression fail?
The cut removed a constraint—the instruction to flag complaints for escalation—while treating it as filler. Constraints change the output if dropped; packaging does not. The model dutifully stopped flagging complaints, and only the test set caught it, which is why constraints must survive every tightening.
Should I avoid model-assisted compression after the rewrite example?
No—it remains useful for long inputs that resist manual tightening. The example shows it must be verified against a baseline that includes edge cases, because a fluent rewrite can silently drop a rarely-triggered rule. Use it, but never adopt the output without checking it.
Key Takeaways
- Bloated system prompts compress well by tightening wording into rules, and the savings multiply because they run on every request.
- Selecting fewer, better-ranked passages can save tokens and improve quality by surfacing the decisive evidence.
- Long conversations compress by summarizing resolved sub-topics while keeping recent turns verbatim.
- Over-tightening an instruction can drop a constraint and silently break an edge case—constraints are not filler.
- Model-assisted rewrites are useful but must be verified against a baseline that includes edge cases, since fluency hides omissions.