Most conversation bugs are not exotic. They are the same handful of failures repeating: the assistant forgets a fact, re-asks a question, contradicts an earlier decision, or resolves a pronoun to the wrong thing. A good checklist catches these before they reach a user, and it does so without requiring you to rediscover the underlying theory each time.
This is that checklist. It is organized by the lifecycle of state in a multi-turn assistant — how you represent state, how you inject it, how you update it, and how you guard against drift. Each item includes a one-line justification so you can decide whether it applies to your situation rather than following it blindly.
Use it as a review gate before shipping a conversational feature, or as a triage guide when an existing assistant starts behaving strangely. The items assume you already understand the basics covered in The Shortest Honest Path to Working Dialogue State in Prompts.
One word on how to use a checklist like this without it becoming theater: do not run every item every time. Identify the failure you are seeing — re-asking, contradicting, looping — and jump to the section that addresses it. The full sweep is for new features and major changes; the targeted sweep is for triage.
Representing State
How you model state determines how reliably the rest of the system behaves.
The representation checklist
- State is structured, not prose. Use named fields, not a paragraph describing what happened — the model parses fields far more reliably than narrative.
- Every field has a defined value space. A
payment_statusthat can only bepending,authorized, orcompletedconstrains the model better than a free-text status. - Null is explicit. Unfilled slots should read
null, not be absent, so the model can distinguish "not yet collected" from "irrelevant." - The focused entity is named. If users use pronouns, maintain a
focused_entityfield so references resolve consistently.
Injecting State Into the Prompt
State that exists in your database but never reaches the prompt does nothing.
The injection checklist
- State appears in a labeled block. A clear header like
CURRENT STATE:lets the model locate it instantly. - Only relevant state is injected. Send what the current turn needs, not the entire object, to keep the prompt lean.
- Backend truth overrides inferred truth. When your system knows a fact, inject the fact verbatim rather than letting the model guess. This mirrors the source-of-truth discipline in A Reusable Model for Tracking Dialogue State in Prompts.
- The instruction references state by field name. "Do not re-present offers in offers_declined" beats vague guidance like "avoid repeating yourself."
Updating State Across Turns
State is only useful if it tracks reality as the conversation moves.
The update checklist
- Updates come from authoritative events. Payment status changes when the payment system says so, not when the model claims success.
- User revisions overwrite cleanly. When a user changes an earlier answer, replace the slot and flag dependents for re-confirmation.
- State updates happen before the next prompt is built. Stale state injected into the next turn is the most common drift bug.
- Out-of-order input is tolerated. Users do not fill slots in your preferred sequence; accept fields whenever they arrive.
Guarding Against Drift and Contradiction
The failures users notice most are contradictions, so guard against them explicitly.
The guardrail checklist
- Negative constraints are anchored to state. Tell the model what not to do based on specific fields — do not re-ask, do not re-suggest, do not re-charge.
- Decisions are sticky. Once
renewal_statusisagreed, downstream prompts must not reopen the objection-handling path. - Conflicts resolve toward the system, not the model. If the model's output disagrees with canonical state, the system value wins.
- You log the exact injected state. When something goes wrong, you need to see what the model actually received, not what you assumed.
Validating Before You Ship
A checklist is only as good as the testing behind it.
The validation checklist
- You have replay tests for known-bad conversations. Capture past failures and assert they no longer recur.
- You test the long-conversation case. Many state bugs only appear after a dozen turns when history grows.
- You measure, not just eyeball. Track re-ask rate and contradiction rate, drawing on Reading the Signal: Metrics for Dialogue State in Prompts.
- You decided build versus buy deliberately. Whether you hand-roll state or adopt a framework should be a choice, informed by Tooling That Tracks Conversation State Across Prompt Turns.
Triage: Matching Symptoms to Checklist Sections
When an existing assistant misbehaves, you do not want the whole checklist — you want the three items that fix the symptom in front of you. This table-of-contents-by-symptom turns the checklist into a fast diagnostic.
Symptom-to-section map
- "It keeps asking me for things I already told it." Go to the injection section. The fact is in storage but is not reaching the prompt, or the instruction is not referencing it by field name.
- "It suggested something I already tried." Go to the update section. You are likely not maintaining an attempted-actions list, so the model has no record to avoid.
- "It contradicted what it said earlier." Go to the guardrails section. A finalized decision is not being treated as sticky, or the model's output is overriding canonical state.
- "It got confused about which thing I meant." Go to the representation section. You probably lack a named focused-entity field for pronoun resolution.
Why symptom-driven triage beats a full sweep
Running every item against a live incident wastes time and buries the actual cause under noise. The failures of multi-turn assistants are specific enough that the symptom almost always points to one section. The metrics in Reading the Signal: Metrics for Dialogue State in Prompts let you confirm the diagnosis with a number rather than a hunch.
Adapting the Checklist to Your Stakes
A checklist that ignores context becomes bureaucracy. The right number of items to enforce scales with how much state your conversations carry and how costly errors are.
Calibrating effort
- Low-stakes, short conversations: enforce the injection basics and skip most guardrails. The transcript approach from Transcript, Summary, or Slots: Deciding How Prompts Hold State may even let you skip structured state entirely.
- High-stakes or long conversations: enforce the full list, especially the guardrails and validation items, because the cost of a single contradiction is high.
- Agentic assistants that take actions: add extra weight to the update and guardrail sections, since a repeated action can mean a duplicated charge or a doubled API call.
The checklist is a tool, not a ritual. Apply the items that earn their place and drop the ones that do not.
Frequently Asked Questions
Do I need every item on this checklist?
No. Short, low-stakes assistants can skip the heavier guardrails. The checklist is a menu; apply the items proportional to conversation length and error cost.
What is the single most important item?
Injecting backend truth verbatim. The moment the model has to infer a fact your system already knows, you have introduced an avoidable failure point.
How do I handle a field the model is unsure about?
Keep it null and instruct the model to collect it. Ambiguity should resolve toward asking, not toward guessing and proceeding.
Should the checklist run automatically or manually?
Both. Use it as a manual review gate for new features and encode the testable items — re-ask rate, contradiction rate — into automated checks.
How does this differ from a regular QA checklist?
It targets the specific failure modes of multi-turn state: forgetting, re-asking, contradicting, and mis-resolving references. Generic QA rarely probes these.
What if my assistant has no persistent state at all?
Then most of this does not apply, and that is fine. The checklist scales with the amount of state a conversation must carry.
Key Takeaways
- Represent state as structured, named fields with explicit value spaces and explicit nulls.
- Inject state in a labeled block and have instructions reference fields by name.
- Update state from authoritative events before building the next prompt, and tolerate out-of-order input.
- Anchor negative constraints to state to prevent re-asking, re-suggesting, and contradiction.
- Validate with replay tests, long-conversation tests, and measured re-ask and contradiction rates.
- Apply checklist items in proportion to conversation length and the cost of errors.