A checklist earns its place only if it is short enough to actually run and specific enough to catch real problems. This one is built to be used, not admired. Each item is something you can verify in seconds, and each comes with a one-line justification so you understand why it is on the list rather than following it blindly.
Use it as a pre-flight pass before any model task goes into regular use, and again whenever you change the model or substantially rewrite the prompt. It is organized into four short phases that follow the natural order of setting up a task: clarify, configure, verify, and record.
Copy it into your own notes and adapt the defaults to your workload. The structure matters more than the exact numbers.
Phase 1: Clarify the Task
Before any setting makes sense, you need to know what you are aiming at.
Items
- Wrote a one-sentence definition of good output. You cannot tune toward an unnamed target; this makes success observable.
- Decided whether the task has a correct answer. This single judgment points you toward low or high before anything else.
- Noted the cost of a bad output. A cheap, human-reviewed mistake permits more variety; an expensive one demands reliability.
- Confirmed it is a single task, not a mix. Mixed tasks deserve separate calls with separate settings, as the examples guide shows.
Phase 2: Configure the Controls
With the target clear, set the controls deliberately rather than by default.
Items
- Locked the prompt at a neutral setting first. Sampling sits on top of the prompt's distribution; tune the prompt before the dial.
- Chose one control to tune, not two. Changing both makes results uninterpretable, the core idea in the step-by-step process.
- Held the other control at its neutral default. Usually that means top-p near 1.0 while you tune temperature.
- Placed the temperature in the right band for the task. Low for correct-answer and structured work, moderate for voice, high for ideation.
Phase 3: Verify Before Shipping
A setting that looks right on one example is not yet trustworthy.
Items
- Ran a sweep across several temperatures. Reading outputs as a group reveals where quality peaks and degrades.
- Chose the setting just before degradation. Bias slightly toward the safe side; reliability rarely disappoints.
- Tested the chosen setting on fresh inputs. A setting tuned on one prompt must hold up across several before you trust it.
- Confirmed structured output stays well-formed. For JSON or code, verify the format does not break, a frequent common mistake.
- Set a fixed seed if reproducibility matters. Temperature 0 alone does not guarantee identical runs.
Phase 4: Record and Schedule
A setting you cannot reproduce is a setting you will retune forever.
Items
- Documented task, temperature, top-p, and prompt version together. This turns one person's tuning into a shared, reusable asset.
- Stored it where the team can find it. Settings trapped in a script or someone's memory drift invisibly.
- Defined the re-tune trigger. Tie re-tuning to model upgrades and prompt rewrites, not the calendar, per the best-practices guide.
- Noted the date of the tuning. Future readers need to know how current the decision is.
A Quick-Reference Band Chart
Beyond the procedural items, it helps to keep a band chart at hand so you are not re-deriving the right neighborhood from scratch each time. These bands are starting anchors, not final answers, but they save a step.
The Bands
- Near 0 (deterministic): data extraction, classification, structured JSON, code generation, factual lookup. Anything where a single token out of place is a hard error belongs here.
- 0.2 to 0.5 (anchored but natural): customer-facing assistants, fact-bound summaries, conversational replies that must not improvise. Natural tone without license to wander.
- 0.5 to 0.8 (voiced): explainers, on-brand marketing prose, documentation with personality. Enough freedom to sound human, enough control to stay on message.
- 0.9 and above (exploratory): brainstorming, naming, fiction, multiple-candidate ideation. Pair with generating several samples and curating.
How to Use the Chart
Treat the chart as your Phase 1 shortcut: once you have decided whether the task has a correct answer, the chart points you to a band, and the Phase 3 sweep refines within it. Cross-check your placement against the examples guide, which works each band through a real scenario so you can confirm by analogy.
Common Reasons an Item Gets Skipped
Knowing why people skip items is the best defense against skipping them yourself.
The Usual Excuses
- "The prompt is obviously fine." It usually is not; locking it at a neutral setting takes a minute and prevents tuning on a moving target.
- "I already know the right number." Borrowed numbers are hypotheses. The verify phase is what turns a hypothesis into a justified default, and it is cheap.
- "This is just a quick task." Quick tasks become permanent ones. An undocumented setting on a throwaway prompt becomes the silent default nobody can explain six months later.
Each skipped item maps to a known failure mode, which is why the checklist exists in the first place.
Using the Checklist Well
A checklist is a tool, and tools can be misused.
Keep It Honest
Resist checking an item you did not actually do. The value comes from the verification, not the checkmark. An item you skip should be visibly skipped, with a reason, so the gap is known rather than hidden.
Adapt the Defaults
The bands and defaults here are starting points drawn from the foundational guide. Your workload may justify different defaults; what should not change is the discipline of clarifying, configuring, verifying, and recording in that order.
Adapting the Checklist to Different Task Types
A single checklist serves every task, but the emphasis shifts depending on what you are building. Knowing where to lean saves time.
For Deterministic Tasks
When the task has a correct answer, the verify phase concentrates on format integrity and reproducibility. The sweep is narrow β you are confirming the setting is low enough β and the structured-output and fixed-seed items carry the most weight. Tone and variety items matter little here.
For Creative Tasks
When the task wants range, the configure and verify phases shift toward generating multiple candidates. You judge the spread of outputs rather than a single result, and the success criterion is about the quality of the best candidate and the usefulness of the variety, not about consistency. The record phase still matters, because a setting that produces good range is worth reusing.
For Voiced Tasks
For on-brand or conversational work, the verify phase weighs tone against the cost of drift most heavily. You are looking for the setting that sounds natural without licensing the model to wander off-message, which is the central balancing act of every voiced task.
Turning the Checklist Into a Habit
A checklist used once and forgotten provides little value. The goal is to make it reflexive.
Lower the Friction
Keep the checklist short enough to run from memory after a few uses. The four phases β clarify, configure, verify, record β are designed to be memorable so that even without the document in front of you, the order stays intact. The underlying logic comes straight from the foundational guide, so understanding the reasoning makes the steps stick.
Review at Natural Checkpoints
Tie a full run to natural moments: shipping a task, upgrading a model, or onboarding someone to the workload. Anchoring the checklist to events you already pause for means it gets run when it matters most, rather than being an extra chore you skip under pressure.
Frequently Asked Questions
How often should I run this whole checklist?
Run it fully before a task enters regular use, and again whenever you change the model or substantially rewrite the prompt. For minor tweaks, the verify and record phases alone are usually enough.
Can I shorten the checklist?
You can, but keep at least one item from each phase: clarify the target, configure one control, verify with a sweep, and record the result. Dropping a whole phase is where problems slip through.
Why is documenting on the checklist at all?
Because undocumented settings drift and get silently overwritten, producing inconsistent quality nobody can explain. Recording the decision is what makes it durable and shareable across a team.
What if I do not have time for a full sweep?
A minimal sweep of even three temperatures is far better than none. The verification step is what separates a justified default from a guess, so compress it rather than skipping it entirely.
Does this apply to creative tasks too?
Yes, with the bands shifted upward. Creative tasks still benefit from clarifying the goal, generating multiple candidates, and recording the setting that produced the best range.
Key Takeaways
- Clarify the task first: define good output, decide if it has a correct answer, and note the cost of failure.
- Configure deliberately: lock the prompt, tune one control, hold the other neutral, and place temperature in the right band.
- Verify with a sweep, choose the setting just before degradation, and test on fresh inputs before shipping.
- Record task, settings, prompt version, and date in a shared place, and trigger re-tuning on model or prompt changes.
- The checklist's value is the verification, not the checkmark; keep it honest and adapt the defaults to your workload.