Watching One Prompt Change Across Five Settings

Abstract advice about temperature only sticks once you see it applied to tasks you recognize. This piece walks through several concrete scenarios — the kind of work people actually use models for — and shows how the right setting shifts dramatically from one to the next. For each, you get the task, the setting that worked, and the reasoning for why other settings failed.

The scenarios are illustrative rather than exhaustive, but they cover the main shapes of work: extracting facts, generating ideas, writing in a voice, producing structured data, and powering a conversation. Once you see the pattern across these, you can place a new task by analogy.

Read these as worked problems. The point is not the specific numbers but the reasoning that justifies them.

Scenario 1: Pulling Data Out of Messy Text

The task: extract the order number, total, and ship date from a forwarded email full of signatures and quoted replies.

What Worked

A temperature near 0 was correct. The task has exactly one right answer per field, and any deviation is an error. At low temperature the model reliably returned the same fields in the same format every run.

Why Higher Settings Failed

At moderate temperature, the model occasionally reformatted dates or paraphrased the total, breaking downstream parsing.
At high temperature, it sometimes invented a plausible-looking but absent field.

This is the textbook case for determinism. The foundational guide frames why correct-answer tasks belong at the low end.

Scenario 2: Naming a New Product

The task: generate twenty candidate names for a productivity app, given a short brief.

What Worked

A high temperature, around 1.0 to 1.2, with several samples generated. The value of naming comes from range — you want to see options you would not have thought of, then pick.

Why Lower Settings Failed

At low temperature, the model returned safe, obvious names that clustered around the same few words.
The variety that makes a naming session useful only appeared once the model was willing to reach for less-likely tokens.

This scenario shows why creative work pairs high temperature with curation, a practice covered in the best-practices guide.

Scenario 3: Writing an On-Brand Explainer

The task: write a three-paragraph explainer in a specific, friendly brand voice.

What Worked

A moderate temperature, around 0.6 to 0.7. The task wants fluency and a little personality without drifting off-message or off-voice.

Why the Extremes Failed

At very low temperature, the prose read stiff and mechanical, losing the warmth the brand needed.
At high temperature, the voice became inconsistent and the model occasionally introduced claims that were not in the brief.

The middle band is where much real writing lives — enough freedom to sound human, enough control to stay on target.

Scenario 4: Generating Structured JSON

The task: produce a JSON object with a fixed schema for an API payload.

What Worked

A temperature near 0. Structured output has rigid rules, and any token that breaks the format is a hard failure, not a stylistic quirk.

Why Higher Settings Failed

At moderate temperature, the model occasionally added an unrequested field or changed a key's casing.
At high temperature, malformed brackets and hallucinated values appeared.

This mirrors one of the common mistakes: using a creative-task setting on a structured-output task. The fix is simply to drop the temperature.

Scenario 5: Powering a Support Assistant

The task: a customer-facing chatbot answering account questions.

What Worked

A low-to-moderate temperature, around 0.3 to 0.5. The assistant needs to sound natural but must not improvise facts about a customer's account.

Why the Reasoning Is Subtle

Too low, and the assistant felt robotic, hurting the experience.
Too high, and it occasionally phrased speculation as fact, which is unacceptable when the answers affect a real account.

The right setting balances a natural tone against the cost of an output that sounds confident but is wrong — exactly the kind of judgment the step-by-step process helps you make deliberately.

Scenario 6: Summarizing a Long Document

The task: condense a ten-page report into a one-paragraph executive summary.

What Worked

A low-to-moderate temperature, around 0.3 to 0.5. Summarization rewards faithfulness to the source while allowing enough fluency that the result reads as prose rather than a list of fragments.

Why the Extremes Failed

At very low temperature, summaries sometimes lifted phrasing verbatim and read choppily, stitching source sentences together.
At high temperature, the model occasionally introduced an emphasis or implication the source did not support, which is a subtle but serious failure in a summary meant to be trusted.

Summarization is deceptively close to extraction but not identical: it needs a little more freedom to compress smoothly, while still staying anchored to what the document actually says.

Scenario 7: Drafting Code From a Description

The task: turn a plain-language description of a function into working code.

What Worked

A low temperature, near 0 to 0.2. Code has a correct structure, and the model should converge on the most likely correct implementation rather than explore stylistic variety.

Why Higher Settings Failed

At moderate temperature, the model sometimes chose an unusual but plausible approach that introduced subtle bugs.
At high temperature, it produced code that looked reasonable but referenced functions or patterns that did not fit the described intent.

When variety is genuinely wanted in code — say, to see alternative approaches — the right move is to generate a few low-temperature samples and compare, not to raise the temperature and hope.

Reading the Pattern Across Scenarios

Lining up these five cases reveals the underlying logic.

The Spectrum by Task Type

Correct-answer and structured tasks sit at the low end for reliability.
Voice-and-fluency tasks sit in the middle.
Open-ended ideation sits at the high end, paired with curation.

Placing a New Task

When you meet an unfamiliar task, ask where it falls on this spectrum: does it have a correct answer, does it need a voice, or does it want range? That single placement gets you to the right neighborhood, and a quick sweep refines from there. The step-by-step process formalizes this placement into a routine you can run on any task.

Watch for Hidden Mixed Tasks

Several of these scenarios looked simple but were actually mixtures. A support assistant blends a voiced tone with a fact-bound constraint; a summary blends extraction with fluent compression. When a task pulls in two directions, no single temperature serves both halves well. The most reliable move is to recognize the tension early and, where possible, split the work into separate calls with their own settings rather than forcing one compromise value onto a task that secretly wants two.

Frequently Asked Questions

Why does the same model need such different settings?

Because temperature controls variety, and different tasks need different amounts of variety. A data extractor wants none; a naming session wants a lot. The model is the same, but what counts as good output is completely different.

Are these exact numbers the ones I should use?

Treat them as starting neighborhoods, not exact prescriptions. Your prompt, model, and quality bar differ from these examples, so run a quick sweep to refine. The reasoning behind each placement matters more than the digits.

How do I handle a task that mixes types?

Split it if you can. If one step extracts data and another writes prose, run them as separate calls with separate settings. Forcing one temperature onto a mixed task usually compromises both halves.

Why pair high temperature with multiple samples?

Because creative value comes from range and selection. A single high-temperature draw might be brilliant or might miss; generating several and picking the best is how you reliably capture the upside.

What setting should a customer-facing assistant use?

Usually low-to-moderate, enough to sound natural but not so high that it improvises facts. The exact point depends on how costly a confidently wrong answer is in your context.

Key Takeaways

Data extraction and structured output belong at the low end for reliability and format integrity.
Naming and ideation belong at the high end, paired with generating multiple candidates and curating.
On-brand writing and support assistants sit in the middle, balancing voice against the cost of wrong output.
Place a new task by asking whether it has a correct answer, needs a voice, or wants range.
The specific numbers are starting neighborhoods; the reasoning behind each placement is what transfers.

Read these as worked problems. The point is not the specific numbers but the reasoning that justifies them.

Scenario 1: Pulling Data Out of Messy Text

The task: extract the order number, total, and ship date from a forwarded email full of signatures and quoted replies.

What Worked

Why Higher Settings Failed

At moderate temperature, the model occasionally reformatted dates or paraphrased the total, breaking downstream parsing.
At high temperature, it sometimes invented a plausible-looking but absent field.

This is the textbook case for determinism. The foundational guide frames why correct-answer tasks belong at the low end.

Scenario 2: Naming a New Product

The task: generate twenty candidate names for a productivity app, given a short brief.

What Worked

A high temperature, around 1.0 to 1.2, with several samples generated. The value of naming comes from range — you want to see options you would not have thought of, then pick.

Why Lower Settings Failed

At low temperature, the model returned safe, obvious names that clustered around the same few words.
The variety that makes a naming session useful only appeared once the model was willing to reach for less-likely tokens.

This scenario shows why creative work pairs high temperature with curation, a practice covered in the best-practices guide.

Scenario 3: Writing an On-Brand Explainer

The task: write a three-paragraph explainer in a specific, friendly brand voice.

What Worked

A moderate temperature, around 0.6 to 0.7. The task wants fluency and a little personality without drifting off-message or off-voice.

Why the Extremes Failed

At very low temperature, the prose read stiff and mechanical, losing the warmth the brand needed.
At high temperature, the voice became inconsistent and the model occasionally introduced claims that were not in the brief.

The middle band is where much real writing lives — enough freedom to sound human, enough control to stay on target.

Scenario 4: Generating Structured JSON

The task: produce a JSON object with a fixed schema for an API payload.

What Worked

A temperature near 0. Structured output has rigid rules, and any token that breaks the format is a hard failure, not a stylistic quirk.

Why Higher Settings Failed

At moderate temperature, the model occasionally added an unrequested field or changed a key's casing.
At high temperature, malformed brackets and hallucinated values appeared.

This mirrors one of the common mistakes: using a creative-task setting on a structured-output task. The fix is simply to drop the temperature.

Scenario 5: Powering a Support Assistant

The task: a customer-facing chatbot answering account questions.

What Worked

A low-to-moderate temperature, around 0.3 to 0.5. The assistant needs to sound natural but must not improvise facts about a customer's account.

Why the Reasoning Is Subtle

Too low, and the assistant felt robotic, hurting the experience.
Too high, and it occasionally phrased speculation as fact, which is unacceptable when the answers affect a real account.

The right setting balances a natural tone against the cost of an output that sounds confident but is wrong — exactly the kind of judgment the step-by-step process helps you make deliberately.

Scenario 6: Summarizing a Long Document

The task: condense a ten-page report into a one-paragraph executive summary.

What Worked

A low-to-moderate temperature, around 0.3 to 0.5. Summarization rewards faithfulness to the source while allowing enough fluency that the result reads as prose rather than a list of fragments.

Why the Extremes Failed

At very low temperature, summaries sometimes lifted phrasing verbatim and read choppily, stitching source sentences together.
At high temperature, the model occasionally introduced an emphasis or implication the source did not support, which is a subtle but serious failure in a summary meant to be trusted.

Summarization is deceptively close to extraction but not identical: it needs a little more freedom to compress smoothly, while still staying anchored to what the document actually says.

Scenario 7: Drafting Code From a Description

The task: turn a plain-language description of a function into working code.

What Worked

A low temperature, near 0 to 0.2. Code has a correct structure, and the model should converge on the most likely correct implementation rather than explore stylistic variety.

Why Higher Settings Failed

At moderate temperature, the model sometimes chose an unusual but plausible approach that introduced subtle bugs.
At high temperature, it produced code that looked reasonable but referenced functions or patterns that did not fit the described intent.

When variety is genuinely wanted in code — say, to see alternative approaches — the right move is to generate a few low-temperature samples and compare, not to raise the temperature and hope.

Reading the Pattern Across Scenarios

Lining up these five cases reveals the underlying logic.

The Spectrum by Task Type

Correct-answer and structured tasks sit at the low end for reliability.
Voice-and-fluency tasks sit in the middle.
Open-ended ideation sits at the high end, paired with curation.

Placing a New Task

Watch for Hidden Mixed Tasks

Frequently Asked Questions

Why does the same model need such different settings?

Are these exact numbers the ones I should use?

How do I handle a task that mixes types?

Split it if you can. If one step extracts data and another writes prose, run them as separate calls with separate settings. Forcing one temperature onto a mixed task usually compromises both halves.

Why pair high temperature with multiple samples?

Because creative value comes from range and selection. A single high-temperature draw might be brilliant or might miss; generating several and picking the best is how you reliably capture the upside.

What setting should a customer-facing assistant use?

Usually low-to-moderate, enough to sound natural but not so high that it improvises facts. The exact point depends on how costly a confidently wrong answer is in your context.

Key Takeaways

Data extraction and structured output belong at the low end for reliability and format integrity.
Naming and ideation belong at the high end, paired with generating multiple candidates and curating.
On-brand writing and support assistants sit in the middle, balancing voice against the cost of wrong output.
Place a new task by asking whether it has a correct answer, needs a voice, or wants range.
The specific numbers are starting neighborhoods; the reasoning behind each placement is what transfers.

Watching One Prompt Change Across Five Settings

Scenario 1: Pulling Data Out of Messy Text

What Worked

Why Higher Settings Failed

Scenario 2: Naming a New Product

What Worked

Why Lower Settings Failed

Scenario 3: Writing an On-Brand Explainer

What Worked

Why the Extremes Failed

Scenario 4: Generating Structured JSON

What Worked

Why Higher Settings Failed

Scenario 5: Powering a Support Assistant

What Worked

Why the Reasoning Is Subtle

Scenario 6: Summarizing a Long Document

What Worked

Why the Extremes Failed

Scenario 7: Drafting Code From a Description

What Worked

Why Higher Settings Failed

Reading the Pattern Across Scenarios

The Spectrum by Task Type

Placing a New Task

Watch for Hidden Mixed Tasks

Frequently Asked Questions

Why does the same model need such different settings?

Are these exact numbers the ones I should use?

How do I handle a task that mixes types?

Why pair high temperature with multiple samples?

What setting should a customer-facing assistant use?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Watching One Prompt Change Across Five Settings

Scenario 1: Pulling Data Out of Messy Text

What Worked

Why Higher Settings Failed

Scenario 2: Naming a New Product

What Worked

Why Lower Settings Failed

Scenario 3: Writing an On-Brand Explainer

What Worked

Why the Extremes Failed

Scenario 4: Generating Structured JSON

What Worked

Why Higher Settings Failed

Scenario 5: Powering a Support Assistant

What Worked

Why the Reasoning Is Subtle

Scenario 6: Summarizing a Long Document

What Worked

Why the Extremes Failed

Scenario 7: Drafting Code From a Description

What Worked

Why Higher Settings Failed

Reading the Pattern Across Scenarios

The Spectrum by Task Type

Placing a New Task

Watch for Hidden Mixed Tasks

Frequently Asked Questions

Why does the same model need such different settings?

Are these exact numbers the ones I should use?

How do I handle a task that mixes types?

Why pair high temperature with multiple samples?

What setting should a customer-facing assistant use?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?