How a Procurement Team Rebuilt Its Vendor Comparisons

A case study is more useful than a list of principles when you can watch the principles meet resistance. The account below follows a mid-sized procurement function that started using a language model to compare vendors, got burned by confident wrong answers, and rebuilt its approach into something the rest of the company would actually trust. The names and exact numbers are generalized, but the arc and the mistakes are representative of how comparison prompting matures inside a real team.

What makes this worth reading is not that the tool worked. It is the specific sequence of decisions that moved it from a clever toy to a reliable part of the process, and the points where the team nearly abandoned it.

The Situation

The procurement team evaluated three to five vendors per category across roughly a dozen categories a year. Each evaluation meant assembling pricing, capability, and risk information into a comparison the buying committee could act on. The work was slow and inconsistent, with each analyst structuring it differently.

The first attempt

An analyst started pasting vendor materials into a model and asking, "Compare these vendors and tell us which to pick." The early outputs looked great—clean tables, decisive recommendations. The committee liked them until one recommendation pointed to a vendor whose pricing the model had simply invented. Trust collapsed almost overnight.

The Decision

Rather than abandon the approach, the team treated the failure as a prompting problem, not a tooling problem. They set one rule: the model could structure and reason about a comparison, but it could not be the source of any factual claim that drove a decision.

Reframing the model's job

This reframing changed everything downstream. The model's role became organizing evidence and surfacing trade-offs, while humans owned the facts. That division mirrors the discipline described in Habits That Make AI Comparisons Hold Up Under Pressure.

The Execution

The team built a standard two-pass prompt and a fixed criteria template.

Pass one: structured analysis, no verdict

Every evaluation began with a prompt that named the category's ranked criteria—total cost of ownership, implementation effort, security posture, and exit risk—and required the model to fill a table with the source or assumption behind each cell, leaving blanks where it lacked evidence and marking any uncertain figure. Crucially, this pass forbade a recommendation.

Pass two: scoped recommendation

A second prompt fed the completed, human-verified table back and asked for a recommendation under the specific conditions of the deal—budget ceiling, timeline, and team capacity. This kept the verdict from anchoring the analysis, the failure mode detailed in Seven Ways Comparison Prompts Quietly Go Wrong.

The verification step

Between passes, an analyst confirmed every load-bearing number against the vendor's actual documentation. Blanks became a checklist. This step took twenty minutes and was the reason the committee's trust returned.

The Outcome

Over two quarters, the team ran fourteen category evaluations through the new process.

Measurable results

Evaluation cycle time dropped because the structure was consistent and the committee no longer re-litigated formatting. More important, not a single recommendation was overturned for a factual error after the verification step became mandatory. The comparisons became something the committee read with confidence rather than suspicion.

The cultural shift

The quieter win was consistency. Because every analyst used the same criteria template and two-pass structure, comparisons across categories became legible to anyone. New analysts ramped faster because the method was explicit rather than personal. The team later formalized it using the structure in A Repeatable Method for Structuring Comparison Prompts.

The Lessons

The model is a structurer, not a source

The single decision that saved the effort was refusing to let the model own facts. Everything else followed from it.

Blanks are a feature

Forcing the model to leave gaps rather than guess turned the comparison into a diligence map. The team came to value the blanks as much as the filled cells.

Trust is rebuilt with process, not promises

After the early failure, no assurance would have restored the committee's confidence. A visible, repeatable verification step did.

What Nearly Derailed It

The redesign was not smooth, and the obstacles are as instructive as the wins.

The temptation to skip verification under deadline

The first time a category evaluation ran up against a hard deadline, an analyst was tempted to skip the verification step and trust the model's figures to save twenty minutes. The team lead held the line, reasoning that a single unverified recommendation could undo the trust they had spent a quarter rebuilding. That discipline under pressure is what separated a durable process from a fair-weather one. The verification step only protects you if it is non-negotiable.

Over-templating the criteria

An early version of the criteria template was too rigid, forcing every category through identical axes even when a category had a genuinely unique concern. Analysts started quietly working around it, which defeated the consistency the template was meant to provide. The fix was a core set of mandatory criteria plus room for category-specific additions—structure where it helped, flexibility where rigidity caused evasion.

How the Process Spread

A method that works in one corner tends to migrate, and this one did.

From procurement to adjacent teams

Once the buying committee trusted procurement's comparisons, other functions noticed. The legal team adapted the two-pass structure for comparing contract options; an engineering group used it for build-versus-buy decisions. What transferred was not the specific criteria but the shape: rank the axes, structure the evidence, verify the facts, then decide. That portable shape is essentially the model laid out in A Repeatable Method for Structuring Comparison Prompts, arrived at through experience rather than theory.

Why the shape traveled but the details did not

The teams that adopted the method successfully resisted copying procurement's exact criteria template. A contract comparison cares about different things than a vendor comparison, so importing procurement's specific axes would have been a category error of its own. What was genuinely reusable was the sequence and the discipline—rank, structure, verify, decide—not the content poured into it. The teams that tried to clone the whole template wholesale found it awkward; the ones that took the shape and filled it with their own axes found it natural. The lesson the organization eventually absorbed was that a method generalizes through its structure, while its specifics stay local to each decision type.

Frequently Asked Questions

What caused the initial loss of trust?

A recommendation built on an invented price. Because the table looked authoritative, no one questioned the number until after the fact. One confident fabrication was enough to discredit the whole approach.

Why did the team split the prompt into two passes?

To stop an early recommendation from biasing the analysis, and to insert a human verification step between structuring the evidence and acting on it. The split made the reasoning inspectable before any verdict.

How long did the verification step add?

About twenty minutes per evaluation. That cost was trivial against the alternative of re-running discredited comparisons and rebuilding committee trust from scratch.

Did standardizing criteria reduce flexibility?

Slightly, but the consistency gain outweighed it. Analysts could still add category-specific criteria; the template just guaranteed the core axes were always present and ranked the same way.

What would the team do differently in hindsight?

Start with the "model structures, humans own facts" rule from day one. The early failure was avoidable; they simply trusted the clean output before testing whether the facts behind it were real.

Is this approach overkill for low-stakes comparisons?

For quick internal choices, yes. The full two-pass, verified process is calibrated to decisions a committee acts on. Lighter comparisons can drop the verification step while keeping the ranked criteria.

How did the team measure whether the new process worked?

Mainly by tracking how often a recommendation was later overturned for a factual error and how much the committee re-argued comparisons it did not trust. Both fell sharply once verification became mandatory and criteria became consistent. Those two signals—reversal rate and re-litigation—were the clearest evidence that the redesign had moved the comparisons from suspect to dependable.

Key Takeaways

A single fabricated figure can destroy trust in AI comparisons overnight.
Reframing the model as a structurer of evidence, not a source of facts, was the decisive fix.
A two-pass prompt separated analysis from recommendation and created room for verification.
Forcing blanks instead of guesses turned the comparison into a diligence checklist.
Standard criteria templates made comparisons consistent and legible across the team.
Trust returned through a visible verification step, not through better-looking output.

The Situation

The first attempt

The Decision

Reframing the model's job

The Execution

The team built a standard two-pass prompt and a fixed criteria template.

Pass one: structured analysis, no verdict

Pass two: scoped recommendation

The verification step

The Outcome

Over two quarters, the team ran fourteen category evaluations through the new process.

Measurable results

The cultural shift

The Lessons

The model is a structurer, not a source

The single decision that saved the effort was refusing to let the model own facts. Everything else followed from it.

Blanks are a feature

Forcing the model to leave gaps rather than guess turned the comparison into a diligence map. The team came to value the blanks as much as the filled cells.

Trust is rebuilt with process, not promises

After the early failure, no assurance would have restored the committee's confidence. A visible, repeatable verification step did.

What Nearly Derailed It

The redesign was not smooth, and the obstacles are as instructive as the wins.

The temptation to skip verification under deadline

Over-templating the criteria

How the Process Spread

A method that works in one corner tends to migrate, and this one did.

From procurement to adjacent teams

Why the shape traveled but the details did not

Frequently Asked Questions

What caused the initial loss of trust?

Why did the team split the prompt into two passes?

How long did the verification step add?

About twenty minutes per evaluation. That cost was trivial against the alternative of re-running discredited comparisons and rebuilding committee trust from scratch.

Did standardizing criteria reduce flexibility?

Slightly, but the consistency gain outweighed it. Analysts could still add category-specific criteria; the template just guaranteed the core axes were always present and ranked the same way.

What would the team do differently in hindsight?

Start with the "model structures, humans own facts" rule from day one. The early failure was avoidable; they simply trusted the clean output before testing whether the facts behind it were real.

Is this approach overkill for low-stakes comparisons?

How did the team measure whether the new process worked?

Key Takeaways

A single fabricated figure can destroy trust in AI comparisons overnight.
Reframing the model as a structurer of evidence, not a source of facts, was the decisive fix.
A two-pass prompt separated analysis from recommendation and created room for verification.
Forcing blanks instead of guesses turned the comparison into a diligence checklist.
Standard criteria templates made comparisons consistent and legible across the team.
Trust returned through a visible verification step, not through better-looking output.

How a Procurement Team Rebuilt Its Vendor Comparisons

The Situation

The first attempt

The Decision

Reframing the model's job

The Execution

Pass one: structured analysis, no verdict

Pass two: scoped recommendation

The verification step

The Outcome

Measurable results

The cultural shift

The Lessons

The model is a structurer, not a source

Blanks are a feature

Trust is rebuilt with process, not promises

What Nearly Derailed It

The temptation to skip verification under deadline

Over-templating the criteria

How the Process Spread

From procurement to adjacent teams

Why the shape traveled but the details did not

Frequently Asked Questions

What caused the initial loss of trust?

Why did the team split the prompt into two passes?

How long did the verification step add?

Did standardizing criteria reduce flexibility?

What would the team do differently in hindsight?

Is this approach overkill for low-stakes comparisons?

How did the team measure whether the new process worked?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

How a Procurement Team Rebuilt Its Vendor Comparisons

The Situation

The first attempt

The Decision

Reframing the model's job

The Execution

Pass one: structured analysis, no verdict

Pass two: scoped recommendation

The verification step

The Outcome

Measurable results

The cultural shift

The Lessons

The model is a structurer, not a source

Blanks are a feature

Trust is rebuilt with process, not promises

What Nearly Derailed It

The temptation to skip verification under deadline

Over-templating the criteria

How the Process Spread

From procurement to adjacent teams

Why the shape traveled but the details did not

Frequently Asked Questions

What caused the initial loss of trust?

Why did the team split the prompt into two passes?

How long did the verification step add?

Did standardizing criteria reduce flexibility?

What would the team do differently in hindsight?

Is this approach overkill for low-stakes comparisons?

How did the team measure whether the new process worked?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?