Plenty of writing explains what temperature is. Far less explains exactly what to do, in order, when you sit down to tune it for a real task. This guide is that missing procedure. It assumes you already grasp the basic idea that low temperature means consistency and high temperature means variety, and it focuses entirely on the workflow.
The process below is deliberately mechanical. You can follow it on a task you have never tuned before and arrive at a defensible setting in under fifteen minutes. It works the same whether you are tuning a summarizer, a code generator, or a brainstorming assistant.
Treat it as a recipe. Run the steps in sequence, resist the urge to skip ahead, and write down what you learn so you never have to rediscover it.
Step 1: Define What Good Output Looks Like
You cannot tune toward a target you have not named. Before touching any setting, write a one-sentence description of a successful output for this task.
Make It Observable
Vague goals like "good summary" are useless for tuning. Concrete goals like "three bullet points, no opinions, each under fifteen words" give you something to check against. The more observable your criteria, the faster the rest goes.
Decide the Failure Cost
Note how bad a wrong output is. If a single bad answer is cheap because a human reviews everything, you can afford more variety. If a bad answer ships straight to a user, you need reliability. This judgment sets your direction before you ever touch a number.
Step 2: Lock the Prompt First
Sampling controls operate on top of whatever distribution your prompt creates. Tuning temperature on a weak prompt is like adjusting the volume on a broken recording.
Get the Prompt Working at a Neutral Setting
Set temperature to a moderate value, around 0.7, and iterate on the prompt until the output is consistently in the right ballpark. Only once the prompt is solid should you start moving the dial. The full guide to sampling control explains why the prompt and the parameters behave as one combined system.
Step 3: Pick One Control to Tune
Decide up front whether you are tuning temperature or top-p, and leave the other at its neutral default. Changing both at once makes results impossible to interpret.
Default to Temperature
For most tasks, tune temperature and hold top-p near 1.0. Temperature gives a smoother, more intuitive range of behavior. Reserve top-p tuning for cases where you specifically need to clamp the model's vocabulary while keeping some variety.
Step 4: Run a Structured Sweep
Now you experiment, but in an organized way rather than by random guessing.
Hold Everything Else Fixed
- Use one representative prompt.
- Generate output at several temperatures: try 0.0, 0.3, 0.7, 1.0, and 1.3.
- For creative tasks, generate two or three samples per setting since variety is the point.
Read Them as a Group
Lay the outputs side by side and judge them against the observable criteria from Step 1. You are looking for the point where quality peaks and then starts to degrade. This mirrors the comparison technique in our examples walkthrough.
Step 5: Choose the Setting at the Edge of Degradation
The best setting is usually just before the output starts to break down, biased slightly toward the safe side.
Bias Toward Reliability
If two settings produce similar quality, choose the lower one. Lower temperature means fewer surprises in production, and you rarely regret a touch more consistency. If your task genuinely rewards range, bias upward instead — but make that a conscious choice, not a default.
Sanity-Check With Fresh Inputs
Your sweep used one prompt. Before committing, run your chosen setting against two or three different inputs to confirm it holds up. A setting that looks great on one example sometimes falls apart on others.
Step 6: Lock It In and Document
A setting you cannot reproduce is a setting you will retune endlessly.
Record the Decision
- Write down the task, the chosen temperature, and the held top-p value.
- Note the date and the prompt version it was tuned against.
- Add it to your team's working checklist so the default is shared, not trapped in one person's memory.
Set a Re-Tune Trigger
Settings can drift out of relevance when you change models or rewrite the prompt. Note the conditions that should prompt a fresh sweep, so you re-tune on purpose rather than discovering the problem in production. The best-practices reference covers when re-tuning is worth the effort.
A Worked Example of the Full Sequence
Walking the steps once on a concrete task makes the abstract procedure feel ordinary.
The Task
Suppose you are building a feature that drafts polite reply suggestions for a shared inbox. The replies must sound human and warm but must never invent commitments the business has not made.
The Steps Applied
- Step 1: Good output is a two-to-three sentence reply, friendly, with no promises about timelines or refunds. A bad reply that invents a commitment is expensive, so reliability matters.
- Step 2: Lock the prompt at 0.7 and iterate until the tone is consistently warm and the replies stay generic about commitments.
- Step 3: Tune temperature, hold top-p at 1.0.
- Step 4: Sweep 0.2, 0.4, 0.6, 0.8 against five representative inbox messages, two samples each.
- Step 5: Quality peaks at 0.4 — warm but anchored. At 0.6 and above, the model occasionally implied a timeline. Choose 0.4.
- Step 6: Record the task, 0.4, top-p 1.0, the prompt version, and today's date in the shared list.
The whole pass takes about fifteen minutes and produces a setting you can defend. The same sequence underpins the Anchor-Range-Verify framework, which names these moves for teams that run them often.
Handling Tasks That Resist a Clear Answer
Not every task produces a clean peak, and the process needs a fallback for those cases.
When Outputs Look Similar Across the Range
If quality barely changes from 0.2 to 0.8, the task is insensitive to temperature. Choose a low-to-moderate value for reliability and stop; further tuning buys nothing. Insensitivity is a finding, not a failure.
When the Right Answer Depends on the Input
Some tasks want different behavior for different inputs — terse for simple questions, expansive for complex ones. A single temperature cannot do both. The fix is usually to split the task or route inputs to different settings rather than forcing one compromise value, a judgment the best-practices guide treats as a prompt-and-setting design decision.
Frequently Asked Questions
How long should this process take?
For a typical task, ten to fifteen minutes once your prompt is stable. Most of that time is reading outputs from the sweep, not waiting on the model. If it is taking much longer, your success criteria are probably too vague.
Can I skip the sweep and just use a recommended number?
You can start from a recommended number, but skipping the sweep means you never learn where your specific task degrades. The sweep is cheap and it is what turns a guess into a justified default. Do not skip it for anything you will use repeatedly.
What if quality never clearly peaks?
If outputs look similar across the whole range, the task is probably insensitive to temperature, which is fine. Pick a low-to-moderate value for reliability and move on. Not every task needs careful tuning.
Should I tune for each prompt or each task?
Tune per task, not per individual prompt. Settings that work for a task type usually transfer across similar prompts. Re-tune only when you change models or substantially rewrite the instruction.
Do I need multiple samples per setting?
For deterministic tasks, one sample per setting is enough. For creative tasks where variety is the goal, generate two or three per setting so you can judge the range, not just a single lucky or unlucky draw.
Key Takeaways
- Define observable success criteria and the cost of a bad output before touching any setting.
- Lock the prompt at a neutral temperature first; never tune sampling on a weak prompt.
- Tune one control at a time, defaulting to temperature with top-p held near 1.0.
- Run a structured sweep, read outputs as a group, and choose the setting just before quality degrades.
- Document the chosen setting with its prompt version and set a trigger for re-tuning when the model or prompt changes.