AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Limits of Single-Shot RetrievalQuery and Document MismatchMulti-Hop QuestionsReranking and Two-Stage RetrievalRetrieve Broadly, Rank PreciselyWhy It MattersQuery Transformation TechniquesAgentic and Iterative RetrievalRetrieve, Evaluate, Retrieve AgainKnowing When to StopEdge Cases That Break Naive PipelinesContext Assembly as Its Own DisciplineOrdering and Position EffectsContext CompressionStructured ContextFrequently Asked QuestionsWhen is reranking worth the extra cost?How do I handle questions that span multiple documents?Is agentic retrieval ready for production?What is the most overlooked advanced failure?Key Takeaways
Home/Blog/Past the Tutorials: The Hard Parts of Feeding Models Context
General

Past the Tutorials: The Hard Parts of Feeding Models Context

A

Agency Script Editorial

Editorial Team

Β·November 21, 2023Β·9 min read
context engineeringcontext engineering advancedcontext engineering guideprompt engineering

The gap between a context pipeline that demos well and one that holds up in production is wide, and it is filled with problems the tutorials skip. Once you have the basics working, the naive approach of embedding a query, fetching the top chunks, and stuffing them into a prompt starts failing on the queries that matter most: the ambiguous ones, the multi-part ones, the ones where the answer lives across several documents.

This piece assumes you already understand the fundamentals. We are past chunk size and prompt structure. The focus here is the techniques and edge cases that separate practitioners from people who followed a quickstart, and the nuance that only shows up when real users hit your system with questions you did not anticipate.

If you are still building your first pipeline, Your First Real Context Engineering Win, Step by Step is the better starting point. Come back when single-shot retrieval stops being enough.

The Limits of Single-Shot Retrieval

The default pipeline assumes one query maps cleanly to one set of relevant chunks. Real queries break that assumption constantly.

Query and Document Mismatch

Users ask questions in language that does not match how documents are written. A user asks "why is my account locked" while the document says "authentication lockout occurs after repeated failed attempts." Embedding similarity helps but does not fully bridge this gap. The advanced fix is query transformation: rewriting or expanding the user's query before retrieval so it better matches the corpus vocabulary.

Multi-Hop Questions

Some questions require chaining facts across documents. "Which of our enterprise plans includes the feature the support team recommended for compliance?" needs information from at least two places, and a single retrieval pass rarely gathers both. These questions demand either decomposition into sub-queries or iterative retrieval.

Reranking and Two-Stage Retrieval

A single retrieval step forces an impossible trade-off: fetch few chunks and risk missing the answer, or fetch many and drown the model in noise. Two-stage retrieval resolves it.

Retrieve Broadly, Rank Precisely

First retrieve a generous candidate set using fast vector search, prioritizing recall. Then apply a more expensive, more accurate reranking model to that candidate set, prioritizing precision. The reranker sees the full query and each candidate together and scores relevance far more accurately than embedding distance alone. You get the recall of broad retrieval and the precision of careful ranking.

Why It Matters

Embedding similarity is a coarse proxy for relevance. A passage can be semantically near a query and still not answer it. The reranking stage is where many production systems recover the accuracy a single-stage pipeline leaves on the table, which is why it appears repeatedly in Context Engineering: Best Practices That Actually Work.

Query Transformation Techniques

When the gap between how users ask and how documents are written is wide, transform the query before it ever hits the index.

  • Query expansion. Generate several phrasings of the question and retrieve for each, then merge results. This catches relevant chunks a single phrasing would miss.
  • Hypothetical answers. Have the model draft a hypothetical answer to the query, then embed that answer to retrieve against. A well-formed answer often sits closer to the relevant documents than the bare question.
  • Decomposition. Split a complex question into sub-questions, retrieve for each, and assemble the combined context. This is the foundation for handling multi-hop queries.

Each technique adds calls and latency, so apply them where query complexity justifies the cost rather than universally.

Agentic and Iterative Retrieval

The frontier of context engineering hands control of retrieval to the model itself, turning a fixed pipeline into a loop.

Retrieve, Evaluate, Retrieve Again

Instead of one pass, the model retrieves, judges whether the result is sufficient, and searches again with a refined query if not. This handles open-ended questions where you cannot know in advance what to fetch. The cost is real: more calls, more latency, and more orchestration logic to keep the loop from spinning.

Knowing When to Stop

The hard part of iterative retrieval is termination. A loop that searches forever burns budget; one that stops too early returns incomplete answers. Production systems set a maximum iteration count and a confidence threshold, then accept that some queries will hit the ceiling. Designing these guardrails is genuinely advanced work and a place where measurement is indispensable.

Edge Cases That Break Naive Pipelines

The failures that reach production are rarely the ones you tested. These recur across systems.

  • Conflicting sources. Two retrieved documents disagree. A naive prompt presents both and lets the model pick arbitrarily. A robust system surfaces the conflict or applies recency and authority rules.
  • Empty retrieval. No chunk is relevant. The system must recognize this and say so, not hallucinate an answer from the nearest irrelevant material.
  • Stale context. A document was updated but the index was not re-embedded, so retrieval serves outdated information confidently. This is an operational failure that quiet pipelines hide for weeks.
  • Permission leakage. Retrieval surfaces a document the user should not see. This is a security failure, not a quality one, and it is covered in depth in The Hidden Risks of Context Engineering (and How to Manage Them).

Context Assembly as Its Own Discipline

Beyond retrieval, the way you assemble the final context, the ordering, structuring, and compression of what reaches the model, is an advanced lever most teams underuse.

Ordering and Position Effects

Models do not weigh every part of a long input equally. Material at the very start and very end of the context tends to get more attention than material in the middle. Advanced practitioners exploit this deliberately, placing the most critical retrieved chunk where the model is most likely to use it and avoiding burying the answer in the middle of a long assembly. This is not a universal law across all models, which is exactly why you measure it on your own stack rather than trusting folklore.

Context Compression

When retrieval returns relevant but verbose material, passing it verbatim wastes tokens and dilutes signal. Compressing each chunk, summarizing or extracting only the query-relevant portion before assembly, lets you fit more relevant information into the same budget. The trade-off is an extra processing step and the risk that compression drops something important, so it earns its place where token budget is tight and the source material is padded.

Structured Context

Handing the model a wall of concatenated text is the naive assembly. Structuring the context, labeling sources, separating retrieved evidence from instructions, marking which chunks are most authoritative, helps the model reason over the material and makes prompt injection harder. This structure is also what enables provenance, since each claim can be traced back to a labeled source.

Frequently Asked Questions

When is reranking worth the extra cost?

Almost always once you are past prototype. Embedding similarity is a coarse relevance signal, and reranking recovers meaningful accuracy for a modest latency cost on a small candidate set. The exception is extremely latency-sensitive paths where even a few hundred milliseconds is unacceptable; there you tune retrieval harder instead.

How do I handle questions that span multiple documents?

Decompose the question into sub-queries, retrieve for each, and assemble the combined context, or use iterative retrieval that searches, evaluates, and searches again. Single-shot retrieval will reliably miss one of the required pieces on multi-hop questions, so detecting that a query is multi-hop and routing it accordingly is the real skill.

Is agentic retrieval ready for production?

For the right problems, yes, with guardrails. Open-ended research-style questions benefit from iterative retrieval, but you must cap iterations, set a confidence threshold, and monitor cost closely. For simple lookups it is wasteful. Treat it as a tool for a specific class of hard queries, not a default architecture.

What is the most overlooked advanced failure?

Stale context from an index that was not re-embedded after source documents changed. It is silent, it serves confidently wrong answers, and it can persist for weeks because the pipeline looks healthy. Re-indexing discipline and freshness monitoring are unglamorous but catch a failure that pure quality testing misses.

Key Takeaways

  • Single-shot retrieval breaks on ambiguous, multi-part, and multi-hop queries; advanced work is about handling exactly those.
  • Two-stage retrieval, broad recall followed by precise reranking, recovers accuracy that single-stage pipelines leave on the table.
  • Query transformation, expansion, hypothetical answers, and decomposition, bridges the gap between user language and document language.
  • Agentic retrieval handles open-ended questions through iterate-evaluate loops, but demands termination guardrails and close cost monitoring.
  • Production failures are edge cases: conflicting sources, empty retrieval, stale context, and permission leakage each need deliberate handling.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification