Chunk and Embed Is Not Enough: RAG With the Why

Generic RAG advice tells you to chunk your documents and use a vector database. Useful RAG advice tells you why, when, and what to do when it fails. This article is the second kind. Every practice below comes with the reasoning behind it, because a practice you do not understand is a practice you will misapply the moment your situation differs from the tutorial.

These are opinionated. Where there is a genuine trade-off, I will name it and tell you which side to favor by default. You can deviate once you understand the cost, but copying these as-is will produce a system that holds up under real traffic rather than just looking good in a demo.

Optimize Retrieval Before You Touch the Model

The single highest-leverage belief in RAG is that retrieval, not generation, is your bottleneck. A capable model handed the right context writes a correct answer; handed the wrong context it writes a fluent wrong one. So spend your engineering effort upstream of the model.

This reorders your whole priority list. Before you evaluate which model to use, make sure your system reliably surfaces the right chunks. Almost every team that thinks they have a model problem actually has a retrieval problem, a point we expand on in the common mistakes.

Use hybrid search as the default

Vector search captures meaning but misses exact strings; keyword search nails exact strings but misses paraphrases. Combining them covers both, and the combination beats either alone on almost any realistic query mix. Make hybrid search your baseline, not an optimization you add later.

Always Rerank Before Generation

Initial retrieval is fast and approximate because it searches across your entire corpus. That speed comes at the cost of precision, so the genuinely best chunk often lands at position eight rather than position one.

A reranker fixes this. It takes your top twenty to fifty candidates and rescores each one against the query using a cross-encoder that reads query and chunk together, which is far more accurate than the vector distance used in initial search. You only run it on a handful of candidates, so the latency cost is small relative to the quality gain. Retrieve wide with fast search, then narrow hard with a reranker, and pass only the top few chunks forward.

Engineer the Prompt for Grounding

The prompt is where you decide how the model behaves with the context you fetched. Vague prompts let the model drift back to its training memory and hallucinate. Explicit prompts keep it honest.

Instruct the model to answer using only the provided context.
Instruct it to say it does not know when the context is insufficient, and mean it.
Instruct it to cite which chunk supports each claim.

That last one does double duty. Citations let users verify answers, and the act of forcing the model to point at a source measurably reduces unsupported claims. Treat citations as a quality mechanism, not just a UI nicety.

Keep the instruction at the top, the context in the middle, the question at the end

Prompt structure matters more than teams expect. Put your behavioral instructions first so they frame everything that follows, place the retrieved context in the middle clearly delimited, and end with the user's question so it is the freshest thing in the model's attention. Label each chunk with its source so the model can cite by reference rather than guessing. Small structural choices here change how reliably the model stays grounded, and they cost nothing to get right.

Measure Everything With a Real Evaluation Set

You cannot improve what you cannot measure, and RAG is uniquely good at hiding its failures because wrong answers are fluent and confident. A labeled evaluation set is non-negotiable.

Build at least fifty question-and-source pairs that reflect real usage, then measure two layers on every change. Retrieval metrics like recall at k tell you whether the right chunk was fetched. Generation metrics like faithfulness and answer relevance tell you whether the model used it correctly. Run this suite before and after every change so you can prove improvement instead of guessing. The step-by-step guide shows how to assemble this set during your first build.

Exploit Metadata and Filtering

Most corpora have structure you are throwing away: document type, date, product, department, access level. Capture these as metadata at index time and filter on them at query time.

The payoff is precision and safety. Scope a query to the right product line and you stop pulling plausible-but-wrong chunks from a different product. Filter by access level and you stop leaking documents a user should not see. Filtering narrows the search space before similarity even runs, which improves both relevance and security at once. Decide your metadata schema before you index, because adding fields later means re-indexing everything.

A practical way to find your schema is to look at how users naturally qualify their questions. If they say "in the billing docs" or "for the enterprise plan," those qualifiers are your metadata fields. Capturing them lets the system honor a scope the user already implied, instead of searching the entire corpus and hoping the right region wins on similarity alone.

Right-Size Your Context Window

More chunks feels safer and is usually worse. Models lose accuracy when relevant facts sit in the middle of a long context, and irrelevant chunks distract the model toward wrong conclusions while inflating cost and latency.

Favor precision over volume. After reranking, pass three to five strong chunks rather than twenty mediocre ones. If you genuinely need broad coverage, summarize or compress retrieved content rather than dumping raw text. The goal is a context that is dense with signal, not padded with hopeful guesses.

Treat the Index as a Living System

A RAG system is only as fresh as its index. Stale documents produce stale answers, and retired documents that linger get cited as if still valid.

Build the update path before launch. Re-embed changed documents, purge retired ones, and version your metadata so an answer can be traced to the document revision it came from. An index you set up once and forget will quietly degrade into a liability, eroding the user trust that grounding was supposed to earn. For where this fits in a launch sequence, see the checklist.

Frequently Asked Questions

What is the single most important best practice?

Optimize retrieval before the model, backed by a real evaluation set. Together they ensure you fix actual problems rather than imagined ones. Nearly every other practice flows from accepting that retrieval quality, not model choice, drives RAG performance.

Is reranking worth the added latency?

Almost always. You run the reranker on a few dozen candidates, not your whole corpus, so the cost is modest, and the precision gain of lifting the right chunk into the top few positions is large. If latency is critical, rerank fewer candidates rather than dropping the step.

How big should my evaluation set be?

Start with fifty pairs that mirror real questions and grow from there. Even a small set catches regressions that spot-checking misses. The point is consistency: run the same set on every change so improvements and regressions are visible rather than felt.

Should I use metadata filtering on every query?

Use it whenever queries map to a known subset of documents, which is most of the time. Filtering improves relevance and can enforce access control. The main cost is designing the metadata schema up front, which is well worth it because retrofitting metadata means re-indexing.

How do I know if my context window is too full?

If accuracy drops as you add chunks, or the model cites irrelevant material, your context is too full. Reduce to the top few reranked chunks and measure. Dense, precise context almost always outperforms large, padded context.

Key Takeaways

Fix retrieval before touching the model; it is the real bottleneck.
Make hybrid search and reranking your defaults, not afterthoughts.
Engineer prompts to ground answers, admit uncertainty, and cite sources.
A labeled evaluation set is non-negotiable because RAG hides its failures.
Use metadata filtering for precision and access control, and plan the schema early.
Favor dense, precise context over volume, and keep the index continuously fresh.

Optimize Retrieval Before You Touch the Model

Use hybrid search as the default

Always Rerank Before Generation

Engineer the Prompt for Grounding

The prompt is where you decide how the model behaves with the context you fetched. Vague prompts let the model drift back to its training memory and hallucinate. Explicit prompts keep it honest.

Instruct the model to answer using only the provided context.
Instruct it to say it does not know when the context is insufficient, and mean it.
Instruct it to cite which chunk supports each claim.

Keep the instruction at the top, the context in the middle, the question at the end

Measure Everything With a Real Evaluation Set

You cannot improve what you cannot measure, and RAG is uniquely good at hiding its failures because wrong answers are fluent and confident. A labeled evaluation set is non-negotiable.

Exploit Metadata and Filtering

Most corpora have structure you are throwing away: document type, date, product, department, access level. Capture these as metadata at index time and filter on them at query time.

Right-Size Your Context Window

Treat the Index as a Living System

A RAG system is only as fresh as its index. Stale documents produce stale answers, and retired documents that linger get cited as if still valid.

Frequently Asked Questions

What is the single most important best practice?

Is reranking worth the added latency?

How big should my evaluation set be?

Should I use metadata filtering on every query?

How do I know if my context window is too full?

Key Takeaways

Fix retrieval before touching the model; it is the real bottleneck.
Make hybrid search and reranking your defaults, not afterthoughts.
Engineer prompts to ground answers, admit uncertainty, and cite sources.
A labeled evaluation set is non-negotiable because RAG hides its failures.
Use metadata filtering for precision and access control, and plan the schema early.
Favor dense, precise context over volume, and keep the index continuously fresh.

Chunk and Embed Is Not Enough: RAG With the Why

Optimize Retrieval Before You Touch the Model

Use hybrid search as the default

Always Rerank Before Generation

Engineer the Prompt for Grounding

Keep the instruction at the top, the context in the middle, the question at the end

Measure Everything With a Real Evaluation Set

Exploit Metadata and Filtering

Right-Size Your Context Window

Treat the Index as a Living System

Frequently Asked Questions

What is the single most important best practice?

Is reranking worth the added latency?

How big should my evaluation set be?

Should I use metadata filtering on every query?

How do I know if my context window is too full?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Chunk and Embed Is Not Enough: RAG With the Why

Optimize Retrieval Before You Touch the Model

Use hybrid search as the default

Always Rerank Before Generation

Engineer the Prompt for Grounding

Keep the instruction at the top, the context in the middle, the question at the end

Measure Everything With a Real Evaluation Set

Exploit Metadata and Filtering

Right-Size Your Context Window

Treat the Index as a Living System

Frequently Asked Questions

What is the single most important best practice?

Is reranking worth the added latency?

How big should my evaluation set be?

Should I use metadata filtering on every query?

How do I know if my context window is too full?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?