Most teams arrive at numerical prompting through improvisation. Someone needs a model to total a column, hand-tunes a prompt until it works on their test case, and ships it. The prompt lives in one person's head, nobody can reproduce why it works, and when it breaks in production there is no defined process for diagnosing it. A workflow fixes this by turning a private craft into a documented procedure that survives the person who built it.
The aim of this article is to give you a concrete workflow with named stages, the artifacts each stage produces, and the gates that prevent a broken prompt from advancing. A good workflow is boring on purpose: it makes the right behavior the path of least resistance and makes skipping verification feel wrong.
This is distinct from a playbook of plays. A workflow is the standing process that any team member can pick up and run for any numerical task, producing the same artifacts and meeting the same bar every time, regardless of who is at the keyboard.
Stage One: Specify the Numerical Task
Before any prompt is written, define what correct looks like.
The artifact: a task specification
A short document that states the inputs, the expected output type, the units and currency, the rounding rules, and the tolerance for error. Without this, "correct" is whatever the demo happened to produce.
What goes in it
- Input format and range, including messy or edge-case inputs.
- Output shape: a single number, a table, a structured object.
- Units, currency, and rounding rules stated explicitly.
- The stakes, which determine how heavy the rest of the workflow needs to be.
Stage Two: Draft the Computation Strategy
Decide where the actual arithmetic will happen.
The default: model plans, code computes
For anything above trivial, the workflow defaults to the model emitting a calculation that deterministic code executes. The draft prompt should ask for an expression or code, not a final number.
Decisions to record
- Whether computation is inline, tool-assisted, or fully code-based.
- Which intermediate values must be labeled and checked.
- How units are carried through each step.
Stage Three: Build the Prompt
Now write the prompt against the specification.
Structure the prompt for verifiability
- Ask for assumptions to be stated before any calculation.
- Request labeled intermediate values for multi-step work, following Breaking Hard Tasks Into Prompts a Model Can Handle.
- Where method selection matters, invite reasoning, drawing on Why Think Step by Step Quietly Changes What Models Can Do.
Keep presentation out
The prompt at this stage produces numbers and expressions, not formatted reports. Formatting is a later, separate concern so that polish never precedes verification.
Stage Four: Assemble an Evaluation Set
A prompt is not done until it has tests.
The artifact: a labeled test set
A collection of representative inputs paired with known-correct outputs, including edge cases and inputs that previously broke things.
How to build it well
- Cover the full input range, not just the happy path.
- Include adversarial cases: large numbers, unusual units, mixed currencies.
- Record the correct answer and how it was derived, so failures are diagnosable.
Stage Five: Run the Verification Gate
This is the gate that prevents a broken prompt from advancing.
What the gate checks
- Accuracy across the evaluation set, against the tolerance from the specification.
- Consistency, by sampling each input several times and checking agreement.
- Independent re-derivation of a sample of answers using a different method.
The rule
If the prompt fails the gate, it returns to Stage Three. Nothing ships from improvisation; everything ships from a prompt that cleared the gate.
Stage Six: Hand Off and Document
The final stage makes the workflow repeatable by someone else.
The artifact: a runbook entry
A short record linking the specification, the prompt, the evaluation set, and the verification results, so the next person can understand and modify the prompt safely.
What the handoff includes
- The specification and the rationale for the computation strategy.
- The prompt and its known limitations.
- The evaluation set and the most recent gate results.
- A note on how to re-run the gate after any change.
Stage Seven: Monitor and Feed Back
A workflow does not end at deployment.
Close the loop
- Capture production errors and add them to the evaluation set.
- Re-run the gate on a schedule, since model updates can shift behavior.
- Treat every new failure as a permanent test case so the same mistake is never paid for twice.
Stage Eight: Assign Ownership and Cadence
A workflow without named owners is a wish, not a process.
Who owns what
Each stage needs a responsible role: someone owns the specification, someone owns prompt design, someone owns the evaluation set and gate. When ownership is implicit, stages get skipped under deadline pressure, and the skipped stage is almost always verification.
The cadence to set
- Re-run the verification gate on a fixed schedule, not only on changes, because model behavior can drift.
- Review the evaluation set periodically to retire stale cases and add new ones.
- Hold a short recurring review of production numerical errors so the loop stays closed.
Common Pitfalls in Adoption
Knowing where teams stall makes the workflow stick.
Where it tends to break down
- Treating the specification as optional, which leaves "correct" undefined and verification toothless.
- Building the evaluation set only from happy-path inputs, so the gate passes prompts that fail on edge cases.
- Letting presentation creep into earlier stages, so formatting masks unverified numbers.
Keeping it healthy
- Make the specification a required artifact, not a nicety, before any prompt is written.
- Seed the evaluation set with adversarial inputs from the start.
- Hold the line that nothing ships without clearing the gate, regardless of deadline.
Frequently Asked Questions
Is this workflow overkill for a simple totaling task?
The specification stage tells you. A low-stakes single-step total can take an abbreviated path, while a high-stakes multi-step calculation justifies the full process. The workflow scales down as readily as it scales up.
What makes this repeatable rather than just thorough?
The artifacts. Because each stage produces a specification, a prompt, an evaluation set, and a runbook entry, a different person can pick up the work and reproduce it without reverse-engineering anyone's intuition.
Where does verification actually live?
In the gate at Stage Five, which checks accuracy, consistency, and independent re-derivation. Crucially, the re-derivation uses a different method than the original so the check is genuinely independent.
How do I stop the evaluation set from going stale?
Feed production failures back into it continuously and re-run the gate on a schedule. The set should encode every real mistake, growing into a durable record of your edge cases.
Can this workflow be automated?
The gate and the monitoring stages automate well, since they are deterministic checks against known answers. Specification and prompt design stay human, because they encode judgment about stakes and correctness.
What is the smallest version worth adopting?
Specification, a code-based computation strategy, and a verification gate. Even a minimal evaluation set with a handful of labeled inputs catches the most common silent errors.
Who should own the workflow if the team is small?
Ownership is about clarity, not headcount. One person can hold several stages, but write down who is accountable for the specification, the prompt, and the gate. The stage that gets quietly skipped under pressure is almost always verification, so guard that ownership most carefully.
How do I prevent the workflow from becoming bureaucratic?
Scale it to stakes. The specification stage explicitly sizes the process to the request, so a low-stakes total takes a short path while a high-stakes calculation justifies the full sequence. The artifacts exist to make handoff possible, not to generate paperwork for its own sake.
Key Takeaways
- A workflow turns private prompt craft into a documented, hand-off-able process.
- Specify units, rounding, and tolerance before writing a single prompt.
- Default to the model planning and code computing for anything above trivial.
- No prompt ships without clearing a verification gate for accuracy, consistency, and independent re-derivation.
- Each stage produces an artifact so a different person can reproduce and extend the work.
- Feed production failures back into the evaluation set so mistakes are paid for only once.