When Only One Engineer Can Actually Run Your RAG System

The difference between a RAG demo and a RAG product is whether anyone but the original engineer can run it. Most RAG systems are tribal knowledge: one person knows why the chunk size is 350, why there's a reranker, why the index rebuilds on Tuesdays. When that person is on vacation and quality drops, nobody can diagnose it.

This article turns RAG into a documented workflow — a sequence of stages, each with defined inputs, outputs, and an owner, so the work is repeatable and hand-off-able. Think of it as the assembly line behind the system. For the conceptual foundations, see A Framework for Retrieval Augmented Generation; here we're concerned with the mechanics of who does what, in what order, and what each stage hands to the next.

Stage 1: Ingestion

Input: Raw source documents. Output: Cleaned, normalized text with metadata. Owner: Data engineer.

Ingestion is where most quality problems are born, silently. A PDF parsed badly, a table flattened into gibberish, an HTML page that brought along its navigation menu — all of it poisons retrieval downstream.

The repeatable version of this stage:

Define a parser per source type (PDF, HTML, Markdown, database export).
Strip boilerplate — headers, footers, navigation — but preserve tables and structure.
Attach metadata to every document: source, last-updated date, access level, author.

Document the parsing rules. When a new source type appears, the workflow tells you exactly what to add rather than leaving the next engineer to reverse-engineer it.

A useful discipline is to spot-check parsed output before it ever reaches the index. Pull ten random parsed documents and read them as a human would. If a table came through as a wall of disconnected numbers or a PDF lost its section headings, you've caught the problem at the cheapest possible point. Garbage that enters at ingestion is nearly impossible to detect later — it just produces subtly wrong answers that look plausible.

Stage 2: Chunking

Input: Cleaned text. Output: Chunks with overlap and metadata. Owner: ML engineer.

Chunking is the most consequential decision in the whole workflow and the most often left undocumented. The output of this stage is what actually gets retrieved.

Make the chunking strategy explicit

Record the chunk size and overlap, and the reason for them.
Chunk on semantic boundaries — paragraphs, sections — not blind character counts.
Keep tables and lists intact within a single chunk wherever possible.

If your chunk size is 350 tokens because that scored best on the evaluation set, write that down. A documented "why" is the difference between a tunable system and a frozen one nobody dares touch.

Stage 3: Embedding and indexing

Input: Chunks. Output: Vectors in the index. Owner: ML engineer.

This stage converts chunks to vectors and stores them. The repeatable concern here is versioning.

Pin the embedding model version. Changing models means re-embedding everything — that has to be a deliberate, documented event.
Store the model version alongside the vectors so you always know what produced them.
Define the re-embedding trigger: which document changes force a refresh.

The hidden trap is mixing vectors from two embedding models in one index. They live in different spaces and retrieval quietly breaks. Versioning prevents it.

Stage 4: Retrieval

Input: User query. Output: Ranked relevant chunks. Owner: Backend engineer.

This is the runtime core. The workflow needs the retrieval logic documented as a configuration, not buried in code.

Specify the search method — vector, keyword, or hybrid — and why.
Specify top-k and whether a reranker runs.
Specify the metadata filters applied, especially access-control filters.

Retrieval Augmented Generation: Best Practices That Actually Work covers how to tune these. The workflow point is that retrieval settings should be visible and changeable without an archaeology dig through the codebase.

Stage 5: Generation

Input: Query plus retrieved chunks. Output: Grounded answer. Owner: Backend engineer.

The retrieved chunks and the query get assembled into a prompt and sent to the model.

Keep the prompt template in version control, not hardcoded inline.
Instruct the model to answer only from the provided context and to cite sources.
Define the fallback when no relevant context exists: refuse, don't guess.

The most-cited passages in your context should usually go closest to the question — models weight position. Document that ordering choice so it survives the next refactor.

One more workflow concern lives here: assembling provenance. Each retrieved chunk should carry its source metadata into the prompt so the answer can cite where it came from. If the generation stage receives bare text with no source attached, you can't produce citations, and citations are what let users trust and verify the output. Wire provenance through from ingestion to generation as a first-class requirement, not an afterthought.

Stage 6: Evaluation and monitoring

Input: Live answers and the evaluation set. Output: Quality metrics and alerts. Owner: Quality owner.

A workflow without a feedback loop isn't repeatable — it's just a pipeline that decays on schedule.

Run the evaluation set on every change to any earlier stage.
Monitor retrieval recall and answer faithfulness in production on a sample.
Alert when metrics drop below threshold and route to the stage owner.

A Step-by-Step Approach to Retrieval Augmented Generation covers building the evaluation set itself. The workflow contribution is that evaluation is a recurring stage, not a launch checkbox.

Stage 7: Handoff documentation

Input: The whole workflow. Output: A runbook. Owner: Whoever built it.

The final stage is what makes everything above repeatable: a runbook that captures every decision and how to operate the system.

A diagram of all stages with inputs, outputs, and owners.
The "why" behind every tuning decision — chunk size, top-k, embedding model.
A troubleshooting guide mapping symptoms to the stage that owns them.

Without this stage, you have a working system and a single point of failure who happens to be a person. The examples of RAG in practice show how teams structure these runbooks.

Frequently Asked Questions

Why document the workflow if the code already works?

Code shows what happens, not why. The why — why this chunk size, why a reranker, why this top-k — lives in someone's head until you write it down. When quality drops six months later, the documented reasoning is what lets anyone diagnose it instead of one specific engineer.

What's the minimum viable documentation?

A stage diagram with owners, the tuning rationale for chunk size and retrieval settings, and a symptom-to-stage troubleshooting table. That's enough for a new engineer to operate and debug the system. You can grow it from there, but those three pieces prevent the single-point-of-failure problem.

How do I keep the workflow from drifting from reality?

Treat the runbook as code: update it in the same pull request that changes behavior. If someone changes the chunk size without updating the rationale, the documentation is already stale. Tying doc updates to behavior changes is the only thing that keeps them honest.

Who owns the workflow when it crosses teams?

Each stage has an owner, but one person — usually the quality owner — should own the workflow as a whole and the evaluation loop that ties stages together. Distributed stage ownership with no overall owner leads to gaps at the seams between stages, which is exactly where bugs hide.

Can this workflow scale to multiple RAG applications?

Yes, and that's the payoff. Once ingestion, chunking, embedding, and retrieval are documented as reusable stages, a second application reuses most of the pipeline and only swaps the corpus and prompt. The undocumented version forces you to rebuild from scratch every time.

Key Takeaways

A repeatable RAG workflow is a sequence of stages, each with defined inputs, outputs, and an owner.
Ingestion and chunking decisions are where most quality is won or lost — document the rules and the rationale.
Version the embedding model and never mix vectors from different models in one index.
Keep prompt templates and retrieval settings in version control, not hardcoded.
Evaluation and monitoring are a recurring stage, not a launch checkbox.
A handoff runbook turns a single-engineer system into a team-owned, repeatable product.

Stage 1: Ingestion

Input: Raw source documents. Output: Cleaned, normalized text with metadata. Owner: Data engineer.

The repeatable version of this stage:

Define a parser per source type (PDF, HTML, Markdown, database export).
Strip boilerplate — headers, footers, navigation — but preserve tables and structure.
Attach metadata to every document: source, last-updated date, access level, author.

Document the parsing rules. When a new source type appears, the workflow tells you exactly what to add rather than leaving the next engineer to reverse-engineer it.

Stage 2: Chunking

Input: Cleaned text. Output: Chunks with overlap and metadata. Owner: ML engineer.

Chunking is the most consequential decision in the whole workflow and the most often left undocumented. The output of this stage is what actually gets retrieved.

Make the chunking strategy explicit

Record the chunk size and overlap, and the reason for them.
Chunk on semantic boundaries — paragraphs, sections — not blind character counts.
Keep tables and lists intact within a single chunk wherever possible.

If your chunk size is 350 tokens because that scored best on the evaluation set, write that down. A documented "why" is the difference between a tunable system and a frozen one nobody dares touch.

Stage 3: Embedding and indexing

Input: Chunks. Output: Vectors in the index. Owner: ML engineer.

This stage converts chunks to vectors and stores them. The repeatable concern here is versioning.

Pin the embedding model version. Changing models means re-embedding everything — that has to be a deliberate, documented event.
Store the model version alongside the vectors so you always know what produced them.
Define the re-embedding trigger: which document changes force a refresh.

The hidden trap is mixing vectors from two embedding models in one index. They live in different spaces and retrieval quietly breaks. Versioning prevents it.

Stage 4: Retrieval

Input: User query. Output: Ranked relevant chunks. Owner: Backend engineer.

This is the runtime core. The workflow needs the retrieval logic documented as a configuration, not buried in code.

Specify the search method — vector, keyword, or hybrid — and why.
Specify top-k and whether a reranker runs.
Specify the metadata filters applied, especially access-control filters.

Stage 5: Generation

Input: Query plus retrieved chunks. Output: Grounded answer. Owner: Backend engineer.

The retrieved chunks and the query get assembled into a prompt and sent to the model.

Keep the prompt template in version control, not hardcoded inline.
Instruct the model to answer only from the provided context and to cite sources.
Define the fallback when no relevant context exists: refuse, don't guess.

The most-cited passages in your context should usually go closest to the question — models weight position. Document that ordering choice so it survives the next refactor.

Stage 6: Evaluation and monitoring

Input: Live answers and the evaluation set. Output: Quality metrics and alerts. Owner: Quality owner.

A workflow without a feedback loop isn't repeatable — it's just a pipeline that decays on schedule.

Run the evaluation set on every change to any earlier stage.
Monitor retrieval recall and answer faithfulness in production on a sample.
Alert when metrics drop below threshold and route to the stage owner.

A Step-by-Step Approach to Retrieval Augmented Generation covers building the evaluation set itself. The workflow contribution is that evaluation is a recurring stage, not a launch checkbox.

Stage 7: Handoff documentation

Input: The whole workflow. Output: A runbook. Owner: Whoever built it.

The final stage is what makes everything above repeatable: a runbook that captures every decision and how to operate the system.

A diagram of all stages with inputs, outputs, and owners.
The "why" behind every tuning decision — chunk size, top-k, embedding model.
A troubleshooting guide mapping symptoms to the stage that owns them.

Without this stage, you have a working system and a single point of failure who happens to be a person. The examples of RAG in practice show how teams structure these runbooks.

Frequently Asked Questions

Why document the workflow if the code already works?

What's the minimum viable documentation?

How do I keep the workflow from drifting from reality?

Who owns the workflow when it crosses teams?

Can this workflow scale to multiple RAG applications?

Key Takeaways

A repeatable RAG workflow is a sequence of stages, each with defined inputs, outputs, and an owner.
Ingestion and chunking decisions are where most quality is won or lost — document the rules and the rationale.
Version the embedding model and never mix vectors from different models in one index.
Keep prompt templates and retrieval settings in version control, not hardcoded.
Evaluation and monitoring are a recurring stage, not a launch checkbox.
A handoff runbook turns a single-engineer system into a team-owned, repeatable product.

When Only One Engineer Can Actually Run Your RAG System

Stage 1: Ingestion

Stage 2: Chunking

Make the chunking strategy explicit

Stage 3: Embedding and indexing

Stage 4: Retrieval

Stage 5: Generation

Stage 6: Evaluation and monitoring

Stage 7: Handoff documentation

Frequently Asked Questions

Why document the workflow if the code already works?

What's the minimum viable documentation?

How do I keep the workflow from drifting from reality?

Who owns the workflow when it crosses teams?

Can this workflow scale to multiple RAG applications?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

When Only One Engineer Can Actually Run Your RAG System

Stage 1: Ingestion

Stage 2: Chunking

Make the chunking strategy explicit

Stage 3: Embedding and indexing

Stage 4: Retrieval

Stage 5: Generation

Stage 6: Evaluation and monitoring

Stage 7: Handoff documentation

Frequently Asked Questions

Why document the workflow if the code already works?

What's the minimum viable documentation?

How do I keep the workflow from drifting from reality?

Who owns the workflow when it crosses teams?

Can this workflow scale to multiple RAG applications?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?