AGENCYSCRIPT
CoursesEnterpriseBlog
πŸ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
Β© 2026 Agency Script, Inc.Β·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Stage One: Define the JobWrite the task contractCollect real examplesStage Two: Build the Evaluation SetPair questions with ground truthKeep it versionedStage Three: Assemble the ContextEstablish the assembly orderSet the budget per sectionStage Four: Test Against the SetMeasure two layers separatelyRecord the baselineStage Five: Iterate and DocumentChange one thing at a timeLog every changeStage Six: Hand It OffWrite the runbookRehearse the handoffWiring the Stages Into a LoopThe maintenance cycleKnowing when to loopKeeping the loop affordableFrequently Asked QuestionsHow long does it take to set up this workflow?Can I skip the evaluation set if I am moving fast?What if my context changes for every user?How do I keep the workflow from becoming bureaucracy?Who should own the runbook?Key Takeaways
Home/Blog/Turn Context Work Into a Process Anyone Can Run
General

Turn Context Work Into a Process Anyone Can Run

A

Agency Script Editorial

Editorial Team

Β·October 6, 2023Β·8 min read
context engineeringcontext engineering workflowcontext engineering guideprompt engineering

There is a moment in most AI projects when the person who understands the context pipeline takes a vacation, and everything quietly stops improving. The retrieval breaks in a way nobody else can diagnose. A new requirement arrives and sits untouched because only one person knows how the pieces fit. The work was never a workflow; it was a skill living in a single head.

A repeatable workflow fixes this. The aim is not to slow anyone down with bureaucracy. It is to make context engineering legible, so a competent teammate can pick up a task, follow defined stages, and produce consistent results without reverse-engineering someone else's intuition. A good workflow turns a craft into a process you can staff, audit, and improve.

This article lays out that workflow stage by stage, with the artifacts each stage produces. By the end you should be able to map your own process onto it and find the gaps where work currently depends on one person's memory.

Stage One: Define the Job

Every context task starts with a clear statement of what good output looks like. Skipping this stage is the root cause of most thrashing later, because without a target you cannot tell whether a change helped.

Write the task contract

Capture three things in plain language:

  • The question or request the system must handle.
  • What a correct, complete answer contains.
  • What sources are allowed to inform it.

This contract becomes the reference everyone returns to when they disagree about whether output is acceptable. It is short, but writing it forces clarity that vague ambition hides. The A Framework for Context Engineering article shows how the contract anchors the broader structure.

Collect real examples

Gather actual requests from logs or stakeholders, not invented ones. Real examples expose the messy phrasings and edge cases that synthetic samples miss. These examples seed your evaluation set later, so collecting them now pays off twice.

Stage Two: Build the Evaluation Set

Before changing anything in the pipeline, build the instrument that tells you whether changes work. Teams that skip this stage end up arguing about output quality from memory, which is unreliable and slow.

Pair questions with ground truth

For each example, record the correct answer or the source passage that should inform it. This lets you check two things independently: did the right information reach the context, and did the model use it correctly. Separating those signals is what makes debugging fast.

Keep it versioned

Store the evaluation set in version control alongside the pipeline. When the set changes, the change is visible and reviewable. An evaluation set that drifts silently is as dangerous as no set at all. Our A Step-by-Step Approach to Context Engineering covers building these sets in practice.

Stage Three: Assemble the Context

This is the stage most people think of as the whole job, but it only works when the prior stages are in place. Assembly is where you decide what the model actually sees.

Establish the assembly order

Define a fixed structure for the payload so it is predictable and debuggable:

  1. System instructions and role definition.
  2. Tool definitions, if any.
  3. Retrieved content, most relevant first.
  4. Conversation history, compressed as needed.
  5. The current user request.

A consistent order means that when something breaks, you know where to look. Random assembly makes every bug a fresh mystery.

Set the budget per section

Assign each section a token allowance so no single part can crowd out the rest. The retrieved content allowance is usually the one to guard most carefully, since it tends to balloon. The Context Engineering: Best Practices That Actually Work guide details budgeting tradeoffs.

Stage Four: Test Against the Set

With the pipeline assembled, run it against the evaluation set and read the results as data, not anecdotes.

Measure two layers separately

  • Retrieval quality: did the needed passage appear, and where?
  • Answer quality: given the context, was the output correct?

A drop in answer quality with healthy retrieval points to assembly or instruction problems. A drop in retrieval quality points upstream. This separation tells you which stage to revisit instead of guessing.

Record the baseline

Save the scores. Every future change is judged against this baseline, so a missing baseline means you can never prove progress. This record also becomes the evidence you show stakeholders that the system is improving.

Stage Five: Iterate and Document

Now you improve, but in a controlled way that preserves the ability to learn from each change.

Change one thing at a time

Adjust a single variable, rerun the evaluation set, and compare to the baseline. Bundling changes destroys your ability to attribute cause. This discipline feels slow for one cycle and pays back across dozens.

Log every change

For each iteration, record what you changed, why, and the resulting scores. This log is the institutional memory that lets a new person understand how the system reached its current state without interviewing the original author.

Stage Six: Hand It Off

The final stage is the one teams skip most, and it is the whole point of building a workflow. A process that only its author can run is not a workflow.

Write the runbook

Document how to run the evaluation set, where the pipeline configuration lives, and how to diagnose the common failure modes. A teammate should be able to follow it and resolve a routine issue without escalating.

Rehearse the handoff

Have someone other than the author run a full cycle while the author watches silently. Every question they ask reveals a gap in the documentation. Fill those gaps until the cycle runs cleanly. The discipline only sticks when the handoff is tested, not assumed.

Wiring the Stages Into a Loop

The six stages are not a one-time march. They form a loop that the system runs through repeatedly as requirements change and sources evolve. Treating them as a single pass is a common reason workflows decay after launch.

The maintenance cycle

Once a system is live, the loop tightens. New failure cases surface from real traffic, and those cases feed back into the evaluation set in stage two. A change in source documents triggers a fresh assembly review in stage three. Each loop leaves the evaluation set richer and the runbook more accurate, which is how the system improves rather than merely holding steady.

Knowing when to loop

Two triggers should start a new cycle: a measured drop in evaluation scores, and any change to the underlying sources or model. Waiting for user complaints means looping too late. The A Step-by-Step Approach to Context Engineering guide describes catching these triggers early.

Keeping the loop affordable

Each pass should reuse the artifacts from the last one. The task contract rarely changes, the evaluation set grows incrementally, and the runbook gets edited rather than rewritten. Because the structure persists, a maintenance loop costs a fraction of the initial setup, which is what makes the discipline sustainable rather than a burden teams abandon.

Frequently Asked Questions

How long does it take to set up this workflow?

The first pass through all six stages for a single task typically takes a few days, most of it spent building the evaluation set. Subsequent tasks reuse the structure and move much faster. The upfront cost is real but one-time.

Can I skip the evaluation set if I am moving fast?

You can, but you will move fast in an unknown direction. Without measurement you cannot tell improvement from regression, and you will eventually spend more time chasing phantom problems than the evaluation set would have cost to build.

What if my context changes for every user?

The workflow still applies. Your evaluation set captures representative cases rather than every possible one. The assembly order and budgeting remain fixed even when the retrieved content varies per request.

How do I keep the workflow from becoming bureaucracy?

Keep the artifacts lightweight. A task contract can be a paragraph; a runbook can be a single page. The discipline is in consistency, not volume. If a document is not actively used during debugging, trim it.

Who should own the runbook?

The person who most recently ran a full cycle owns keeping it current. Rotating that responsibility ensures the documentation reflects how the process actually works rather than how it worked at launch.

Key Takeaways

  • A workflow turns context engineering from a one-person skill into a staffable process.
  • Start by defining what good output looks like before touching the pipeline.
  • Build a versioned evaluation set first so every later change can be measured.
  • Use a fixed assembly order and per-section token budgets for predictability.
  • Change one variable at a time and log every iteration to preserve cause and effect.
  • The handoff stage is the point; rehearse it until a teammate can run a full cycle alone.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way β€” a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Case Study: Large Language Models in Practice

Most teams that fail with large language models don't fail because the technology doesn't work. They fail because they treat deployment as a one-time event rather than a discipline β€” pick a model, wri

A
Agency Script Editorial
June 1, 2026Β·11 min read
General

Thirty-Second Wins Breed False Confidence With LLMs

Working with large language models is deceptively easy to start and surprisingly hard to do well. You can get a useful output in thirty seconds, which creates a false confidence that compounds over ti

A
Agency Script Editorial
June 1, 2026Β·10 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification