Run Your RAG System Through This Before You Ship It

Before you ship a retrieval augmented generation system, run it through this checklist. It is organized by phase, from data preparation to launch readiness, and every item carries a short justification so you can judge which apply to your situation and which you can safely defer. Treat it as a working tool, not a wall of best intentions. Copy it, check items off, and use the unchecked ones as your to-do list.

The items are ordered roughly the way you should tackle them, because earlier phases gate later ones. There is no point tuning generation if your chunks are incoherent, and no point launching if you cannot measure whether you regressed. If you want the reasoning in depth behind any item, the best practices guide expands on most of them.

A word on how to use this in a team setting. Assign each phase an owner and review the checklist together before launch, because the phases interact. The person tuning retrieval needs to know how chunking was done; the person writing the prompt needs to know what metadata is available for citations. Walking the list as a group surfaces the gaps that individual owners miss in isolation, and it gives everyone a shared definition of what "done" means for this system.

Data Preparation

This phase sets the ceiling on everything downstream. Get it wrong and no amount of clever retrieval recovers.

Documents converted to clean text. Confirm PDFs and HTML extracted without scrambling tables or multi-column layouts. Garbled input produces garbled answers no model can fix.
Boilerplate removed. Strip repeated headers, footers, and navigation so they do not pollute embeddings and waste retrieval slots.
Source and recency verified. Confirm you are indexing current, authoritative documents, not stale or duplicate copies that will surface as confident wrong answers.
Duplicates and near-duplicates handled. Multiple copies of the same content waste retrieval slots and can crowd out other relevant chunks. Deduplicate before indexing so a single query does not return five versions of the same paragraph.

Chunking

The chunk is the smallest unit retrieval can return, so its quality is your precision ceiling.

Chunked on natural boundaries. Split on headings, paragraphs, and logical blocks before falling back to fixed sizes, so each chunk is a coherent thought.
Chunk size tuned, not copied. Start near 500 tokens with overlap, then adjust against your evaluation set. Dense reference content often wants smaller chunks; narrative content wants larger.
Chunks spot-read for coherence. Read ten random chunks. Each should stand alone as a complete idea. Fragments or sprawling multi-topic chunks mean you are not done.
Structure preserved where it matters. For tables, code, and lists, confirm chunking did not sever them mid-structure. A chunk that ends in the middle of a table or a function is far less useful than one that respects the unit's boundary.

Embedding and Storage

Single embedding model committed. The model used to index must match the one used to query, or similarity scores are meaningless.
Original text stored with vectors. You need the source text for the prompt and for citations, not just the vector.
Metadata captured at index time. Record source, section, date, product, and access level now. Retrofitting metadata means re-indexing everything, so design the schema before the first load.

Retrieval

This is where most quality is won, so do not shortcut it.

Hybrid search enabled. Combine vector and keyword search so exact terms like codes and names are not missed. Pure vector search reliably fails on specifics, a trap detailed in the common mistakes.
Reranking in place. Retrieve a wide candidate set, then rerank with a cross-encoder and keep only the top few. Initial search is fast and approximate; reranking is what gets the best chunk into the prompt.
Metadata filtering applied. Scope retrieval to the relevant subset and enforce access control before similarity search runs. This improves both relevance and security.

Generation

Grounding instruction present. The prompt explicitly tells the model to answer only from the provided context.
Uncertainty handling defined. The model is instructed to admit when context is insufficient and, where relevant, to escalate rather than fabricate.
Citations returned. Every answer points to its source chunks so users can verify, which also measurably reduces unsupported claims.
Context size right-sized. You pass a few precise chunks, not twenty mediocre ones, because models lose accuracy when relevant facts are buried in long context.

Evaluation

The phase teams skip and regret. RAG hides its failures behind fluent prose, so measurement is the only honest signal.

Evaluation set built. At least fifty question-and-source pairs reflecting real usage. This is your control panel; without it every change is a guess. The step-by-step guide shows how to assemble it.
Retrieval metrics tracked. Recall and precision at k tell you whether the right chunk was fetched at all.
Generation metrics tracked. Faithfulness and answer relevance tell you whether the model used the context correctly.
Regression check automated. The eval suite runs on every change so improvements are proven and regressions are caught before they ship.

Launch Readiness

Index update path built. You can re-embed changed documents and purge retired ones. A stale index produces stale answers no matter how good the pipeline is.
Document versioning in metadata. You can trace any answer to the document revision it came from, which matters for debugging and for regulated domains.
Failure and escalation behavior tested. You have confirmed what happens when retrieval returns nothing relevant, and that it degrades gracefully rather than fabricating.
Cost and latency measured. You know the per-query cost and response time under realistic load, not just in a single-user demo.
Monitoring and logging in place. You capture queries, retrieved chunks, and answers so you can audit failures after launch. Production reveals query patterns no test set anticipated, and logs are how you find and fix them.

Frequently Asked Questions

Which checklist items are truly non-negotiable?

The evaluation set, hybrid search, grounding instructions, and the index update path. These four prevent the failure modes that most often sink RAG deployments. The rest are strongly recommended, but skipping any of these four tends to produce a system that breaks in production.

Can I launch without reranking?

Yes, for a first version with hybrid search and a small k. Add reranking when your evaluation shows the right chunk is being retrieved but ranked too low to enter the prompt. It is a precision upgrade rather than a launch blocker.

How early should I build the evaluation set?

Before you tune anything. It is the instrument that tells you whether each change helped or hurt, and RAG's fluent wrong answers make eyeballing unreliable. Building it late means you have been optimizing blind the whole time.

What if I cannot capture rich metadata?

Capture at least source and date. Even minimal metadata enables citations and freshness checks. Richer fields like product and access level unlock filtering, but designing the schema early matters more than completeness, because adding fields later forces a re-index.

Is this checklist still valid as models improve?

Yes. Better models reduce some generation errors but do nothing for retrieval, chunking, freshness, or evaluation, which remain the dominant failure sources. The checklist targets the systems problems around the model, and those persist regardless of model quality.

Key Takeaways

Work the checklist in phase order; earlier phases gate later ones.
Data cleanup and coherent chunking set the ceiling on everything downstream.
Hybrid search, reranking, and metadata filtering are the retrieval levers that matter.
Ground generation, handle uncertainty, cite sources, and keep context precise.
The evaluation set is non-negotiable because RAG hides its failures.
Confirm the index update path, versioning, escalation, and cost before launch.

Data Preparation

This phase sets the ceiling on everything downstream. Get it wrong and no amount of clever retrieval recovers.

Documents converted to clean text. Confirm PDFs and HTML extracted without scrambling tables or multi-column layouts. Garbled input produces garbled answers no model can fix.
Boilerplate removed. Strip repeated headers, footers, and navigation so they do not pollute embeddings and waste retrieval slots.
Source and recency verified. Confirm you are indexing current, authoritative documents, not stale or duplicate copies that will surface as confident wrong answers.
Duplicates and near-duplicates handled. Multiple copies of the same content waste retrieval slots and can crowd out other relevant chunks. Deduplicate before indexing so a single query does not return five versions of the same paragraph.

Chunking

The chunk is the smallest unit retrieval can return, so its quality is your precision ceiling.

Chunked on natural boundaries. Split on headings, paragraphs, and logical blocks before falling back to fixed sizes, so each chunk is a coherent thought.
Chunk size tuned, not copied. Start near 500 tokens with overlap, then adjust against your evaluation set. Dense reference content often wants smaller chunks; narrative content wants larger.
Chunks spot-read for coherence. Read ten random chunks. Each should stand alone as a complete idea. Fragments or sprawling multi-topic chunks mean you are not done.
Structure preserved where it matters. For tables, code, and lists, confirm chunking did not sever them mid-structure. A chunk that ends in the middle of a table or a function is far less useful than one that respects the unit's boundary.

Embedding and Storage

Single embedding model committed. The model used to index must match the one used to query, or similarity scores are meaningless.
Original text stored with vectors. You need the source text for the prompt and for citations, not just the vector.
Metadata captured at index time. Record source, section, date, product, and access level now. Retrofitting metadata means re-indexing everything, so design the schema before the first load.

Retrieval

This is where most quality is won, so do not shortcut it.

Hybrid search enabled. Combine vector and keyword search so exact terms like codes and names are not missed. Pure vector search reliably fails on specifics, a trap detailed in the common mistakes.
Reranking in place. Retrieve a wide candidate set, then rerank with a cross-encoder and keep only the top few. Initial search is fast and approximate; reranking is what gets the best chunk into the prompt.
Metadata filtering applied. Scope retrieval to the relevant subset and enforce access control before similarity search runs. This improves both relevance and security.

Generation

Grounding instruction present. The prompt explicitly tells the model to answer only from the provided context.
Uncertainty handling defined. The model is instructed to admit when context is insufficient and, where relevant, to escalate rather than fabricate.
Citations returned. Every answer points to its source chunks so users can verify, which also measurably reduces unsupported claims.
Context size right-sized. You pass a few precise chunks, not twenty mediocre ones, because models lose accuracy when relevant facts are buried in long context.

Evaluation

The phase teams skip and regret. RAG hides its failures behind fluent prose, so measurement is the only honest signal.

Evaluation set built. At least fifty question-and-source pairs reflecting real usage. This is your control panel; without it every change is a guess. The step-by-step guide shows how to assemble it.
Retrieval metrics tracked. Recall and precision at k tell you whether the right chunk was fetched at all.
Generation metrics tracked. Faithfulness and answer relevance tell you whether the model used the context correctly.
Regression check automated. The eval suite runs on every change so improvements are proven and regressions are caught before they ship.

Launch Readiness

Index update path built. You can re-embed changed documents and purge retired ones. A stale index produces stale answers no matter how good the pipeline is.
Document versioning in metadata. You can trace any answer to the document revision it came from, which matters for debugging and for regulated domains.
Failure and escalation behavior tested. You have confirmed what happens when retrieval returns nothing relevant, and that it degrades gracefully rather than fabricating.
Cost and latency measured. You know the per-query cost and response time under realistic load, not just in a single-user demo.
Monitoring and logging in place. You capture queries, retrieved chunks, and answers so you can audit failures after launch. Production reveals query patterns no test set anticipated, and logs are how you find and fix them.

Frequently Asked Questions

Which checklist items are truly non-negotiable?

Can I launch without reranking?

How early should I build the evaluation set?

What if I cannot capture rich metadata?

Is this checklist still valid as models improve?

Key Takeaways

Work the checklist in phase order; earlier phases gate later ones.
Data cleanup and coherent chunking set the ceiling on everything downstream.
Hybrid search, reranking, and metadata filtering are the retrieval levers that matter.
Ground generation, handle uncertainty, cite sources, and keep context precise.
The evaluation set is non-negotiable because RAG hides its failures.
Confirm the index update path, versioning, escalation, and cost before launch.

Run Your RAG System Through This Before You Ship It

Data Preparation

Chunking

Embedding and Storage

Retrieval

Generation

Evaluation

Launch Readiness

Frequently Asked Questions

Which checklist items are truly non-negotiable?

Can I launch without reranking?

How early should I build the evaluation set?

What if I cannot capture rich metadata?

Is this checklist still valid as models improve?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Run Your RAG System Through This Before You Ship It

Data Preparation

Chunking

Embedding and Storage

Retrieval

Generation

Evaluation

Launch Readiness

Frequently Asked Questions

Which checklist items are truly non-negotiable?

Can I launch without reranking?

How early should I build the evaluation set?

What if I cannot capture rich metadata?

Is this checklist still valid as models improve?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?