Worked Scenarios Where Staged Reasoning Made the Call

Principles are easier to remember when you have seen them in action. This article walks through specific scenarios where staged reasoning prompts were used, showing the actual structure that worked and, just as usefully, the structure that did not. The goal is to make the abstract advice concrete enough to copy.

Each example below follows the same shape: the task, the prompt structure, what happened, and the lesson. The scenarios are drawn from common categories, pricing decisions, diagnostic reasoning, and multi-constraint planning, so that at least one should resemble work you actually do.

Pay attention to the small structural choices in each case. The difference between a prompt that works and one that fails is rarely dramatic. It is usually one missing input, one step in the wrong place, or one buried conclusion.

Example One: Choosing Between Pricing Tiers

A common task: decide which of several plans is cheapest given expected usage.

The prompt that worked

The successful prompt listed every number explicitly, then named four steps: compute the cost of each plan at current usage, compute it at projected usage, identify the break-even point, and recommend. The conclusion went under a "Recommendation" heading.

Why it succeeded

The numbers were all present, so the model never had to guess. The steps were ordered so the comparison came after the individual costs were known. And the labeled recommendation made the answer trivial to extract. This mirrors the construction in the step-by-step approach.

Example Two: The Same Task, Done Badly

The same pricing problem failed when structured poorly.

What went wrong

The failing version asked simply, "Which plan is cheaper for a growing team?" without the actual numbers and without named steps. The model produced confident, well-written reasoning built on usage figures it invented.

The lesson

The output looked authoritative and was wrong, because it reasoned correctly from false premises. Missing inputs do not produce obviously broken answers; they produce plausible ones, which is more dangerous. This is exactly the failure the common mistakes article warns about.

Example Three: Diagnosing a Slow Process

A diagnostic task: figure out why a workflow is taking too long.

The prompt that worked

The prompt supplied the steps in the workflow with their durations, then instructed: list each step's time, identify the steps above a threshold, check whether any depend on each other, and propose the single highest-impact fix.

Why it succeeded

By forcing the model to rank steps by duration before proposing a fix, the prompt prevented it from latching onto the first plausible cause. The dependency check stopped it from recommending a fix that another step would undo. Ordered reasoning beat intuition.

Example Four: Planning Under Multiple Constraints

A planning task: schedule work given a budget, a deadline, and a team size.

The prompt that worked

The prompt marked which constraints were hard ("must finish by the deadline") and which were soft ("prefer to stay under budget"). It then asked the model to list candidate plans, eliminate any violating a hard constraint, and rank survivors by how well they met the soft ones.

Why it succeeded

Distinguishing hard from soft constraints was the decisive move. Without it, the model traded away the deadline to save money, which was unacceptable. With it, the reasoning respected the non-negotiables first. The best practices guide explains why this distinction matters so much.

Example Five: When Splitting Calls Paid Off

A multi-stage task that one prompt handled poorly.

The single-prompt version

A request to read a long document, extract the key facts, analyze them, and write a recommendation in one call produced reasoning that drifted, mixing extraction errors into the analysis with no way to tell where things went wrong.

The split version

Breaking it into three calls, extract, analyze, recommend, made each stage testable. When the recommendation was off, it was obvious whether the fault was bad extraction or bad analysis. Reliability rose and debugging time fell, illustrating the pipeline approach in the framework article.

Example Six: A Self-Check That Caught a Real Error

A calculation task where self-review earned its tokens.

The setup

After producing a multi-step cost calculation, the prompt asked the model to re-add the figures and confirm the total matched, flagging any discrepancy.

What happened

On one run the model had transposed two digits mid-calculation. The targeted self-check caught the mismatch and corrected it. A vague "double-check your work" would likely have missed it; the specific instruction to re-add the figures is what made the check effective.

Example Seven: A Comparison Across Many Criteria

A research task: rank five vendors against eight weighted criteria.

The prompt that worked

The prompt listed all five vendors, all eight criteria with their weights, and the relevant data for each vendor. It then instructed the model to score each vendor on each criterion first, apply the weights second, sum the weighted scores third, and rank fourth, with the ranked table as the labeled output.

Why it succeeded

The decisive choice was forcing the scoring to finish before any ranking began. When earlier drafts asked for a ranking directly, the model anchored on the first vendor it found appealing and reasoned backward to justify it. Separating scoring from ranking removed that shortcut, because by the time the model ranked, the scores already existed and constrained the answer. This is the same dependency discipline the checklist enforces.

What the Examples Have in Common

Across seven scenarios, the same handful of moves kept appearing.

Inputs first, always

Every successful prompt put all the facts in front of the model before asking it to reason. Every failure traced back, at least in part, to an input left in the author's head. If there is a single lesson the examples teach, it is that a model cannot reason its way around a missing fact; it can only guess.

Structure beats exhortation

None of the working prompts relied on telling the model to be careful or thorough. They relied on naming steps and ordering them so each built on the last. Exhortation is a wish; structure is an instruction. The cases where structure replaced exhortation are the cases that worked, which is precisely the argument the best practices guide makes from principle.

Legible failures are recoverable

In the failing examples, the most damaging trait was not the error itself but its invisibility. A wrong answer dressed in clean prose gets used. The prompts that surfaced their reasoning let a human catch the error before it cost anything, which is the entire practical case for staged reasoning.

Frequently Asked Questions

What is the single most common reason these prompts fail?

Missing inputs. When the model lacks a needed fact, it does not refuse; it invents a plausible value and reasons from it. The result looks correct and is not, which is why supplying every input explicitly matters so much.

Why does ordering the steps make such a difference?

Because later steps often depend on earlier results. If the model is asked to compare before it has computed the things being compared, it either guesses or contradicts itself. Correct order lets each step build on solid prior work.

When should I split a task into multiple calls instead of one prompt?

When the task has distinct stages and you need to know which stage failed. Splitting makes each stage independently testable, which is worth the extra calls for anything high-stakes or hard to debug as a single block.

Do self-checks always catch errors?

No, and vague ones rarely do. A self-check works when it targets a specific failure, like re-adding figures to catch arithmetic slips. Generic instructions to review tend to produce a confident "looks good" without finding anything.

How do I distinguish hard constraints from soft ones in a prompt?

State it directly. Label non-negotiables as requirements that must be met and preferences as goals to optimize after the requirements hold. The model will respect the distinction if you make it explicit rather than leaving it implied.

Key Takeaways

Listing every number and constraint explicitly is what separated working prompts from confident-but-wrong ones across these examples.
Ordering steps so comparisons come after computations prevents the model from guessing or contradicting itself.
Marking hard versus soft constraints lets the model respect non-negotiables before optimizing preferences.
Splitting multi-stage tasks into separate calls makes each stage testable and reveals exactly where a result went wrong.
Targeted self-checks, like re-adding figures, catch real errors that vague review instructions miss.
The gap between a prompt that works and one that fails is usually one missing input, one misordered step, or one buried conclusion.

Example One: Choosing Between Pricing Tiers

A common task: decide which of several plans is cheapest given expected usage.

The prompt that worked

Why it succeeded

Example Two: The Same Task, Done Badly

The same pricing problem failed when structured poorly.

What went wrong

The lesson

Example Three: Diagnosing a Slow Process

A diagnostic task: figure out why a workflow is taking too long.

The prompt that worked

Why it succeeded

Example Four: Planning Under Multiple Constraints

A planning task: schedule work given a budget, a deadline, and a team size.

The prompt that worked

Why it succeeded

Example Five: When Splitting Calls Paid Off

A multi-stage task that one prompt handled poorly.

The single-prompt version

The split version

Example Six: A Self-Check That Caught a Real Error

A calculation task where self-review earned its tokens.

The setup

After producing a multi-step cost calculation, the prompt asked the model to re-add the figures and confirm the total matched, flagging any discrepancy.

What happened

Example Seven: A Comparison Across Many Criteria

A research task: rank five vendors against eight weighted criteria.

The prompt that worked

Why it succeeded

What the Examples Have in Common

Across seven scenarios, the same handful of moves kept appearing.

Inputs first, always

Structure beats exhortation

Legible failures are recoverable

Frequently Asked Questions

What is the single most common reason these prompts fail?

Why does ordering the steps make such a difference?

When should I split a task into multiple calls instead of one prompt?

Do self-checks always catch errors?

How do I distinguish hard constraints from soft ones in a prompt?

Key Takeaways

Listing every number and constraint explicitly is what separated working prompts from confident-but-wrong ones across these examples.
Ordering steps so comparisons come after computations prevents the model from guessing or contradicting itself.
Marking hard versus soft constraints lets the model respect non-negotiables before optimizing preferences.
Splitting multi-stage tasks into separate calls makes each stage testable and reveals exactly where a result went wrong.
Targeted self-checks, like re-adding figures, catch real errors that vague review instructions miss.
The gap between a prompt that works and one that fails is usually one missing input, one misordered step, or one buried conclusion.

Worked Scenarios Where Staged Reasoning Made the Call

Example One: Choosing Between Pricing Tiers

The prompt that worked

Why it succeeded

Example Two: The Same Task, Done Badly

What went wrong

The lesson

Example Three: Diagnosing a Slow Process

The prompt that worked

Why it succeeded

Example Four: Planning Under Multiple Constraints

The prompt that worked

Why it succeeded

Example Five: When Splitting Calls Paid Off

The single-prompt version

The split version

Example Six: A Self-Check That Caught a Real Error

The setup

What happened

Example Seven: A Comparison Across Many Criteria

The prompt that worked

Why it succeeded

What the Examples Have in Common

Inputs first, always

Structure beats exhortation

Legible failures are recoverable

Frequently Asked Questions

What is the single most common reason these prompts fail?

Why does ordering the steps make such a difference?

When should I split a task into multiple calls instead of one prompt?

Do self-checks always catch errors?

How do I distinguish hard constraints from soft ones in a prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Worked Scenarios Where Staged Reasoning Made the Call

Example One: Choosing Between Pricing Tiers

The prompt that worked

Why it succeeded

Example Two: The Same Task, Done Badly

What went wrong

The lesson

Example Three: Diagnosing a Slow Process

The prompt that worked

Why it succeeded

Example Four: Planning Under Multiple Constraints

The prompt that worked

Why it succeeded

Example Five: When Splitting Calls Paid Off

The single-prompt version

The split version

Example Six: A Self-Check That Caught a Real Error

The setup

What happened

Example Seven: A Comparison Across Many Criteria

The prompt that worked

Why it succeeded

What the Examples Have in Common

Inputs first, always

Structure beats exhortation

Legible failures are recoverable

Frequently Asked Questions

What is the single most common reason these prompts fail?

Why does ordering the steps make such a difference?

When should I split a task into multiple calls instead of one prompt?

Do self-checks always catch errors?

How do I distinguish hard constraints from soft ones in a prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?