Concrete Scenarios That Reveal Whether Your Dialogue State Holds

Q: How do I handle a field the model is unsure about?

Keep it null and instruct the model to collect it. Ambiguity should resolve toward asking, not toward guessing and proceeding.

Most conversation bugs are not exotic. They are the same handful of failures repeating: the assistant forgets a fact, re-asks a question, contradicts an earlier decision, or resolves a pronoun to the wrong thing. A good checklist catches these before they reach a user, and it does so without requiring you to rediscover the underlying theory each time.

This is that checklist. It is organized by the lifecycle of state in a multi-turn assistant — how you represent state, how you inject it, how you update it, and how you guard against drift. Each item includes a one-line justification so you can decide whether it applies to your situation rather than following it blindly.

Use it as a review gate before shipping a conversational feature, or as a triage guide when an existing assistant starts behaving strangely. The items assume you already understand the basics covered in The Shortest Honest Path to Working Dialogue State in Prompts.

One word on how to use a checklist like this without it becoming theater: do not run every item every time. Identify the failure you are seeing — re-asking, contradicting, looping — and jump to the section that addresses it. The full sweep is for new features and major changes; the targeted sweep is for triage.

Representing State

How you model state determines how reliably the rest of the system behaves.

The representation checklist

State is structured, not prose. Use named fields, not a paragraph describing what happened — the model parses fields far more reliably than narrative.
Every field has a defined value space. A payment_status that can only be pending, authorized, or completed constrains the model better than a free-text status.
Null is explicit. Unfilled slots should read null, not be absent, so the model can distinguish "not yet collected" from "irrelevant."
The focused entity is named. If users use pronouns, maintain a focused_entity field so references resolve consistently.

Injecting State Into the Prompt

State that exists in your database but never reaches the prompt does nothing.

The injection checklist

State appears in a labeled block. A clear header like CURRENT STATE: lets the model locate it instantly.
Only relevant state is injected. Send what the current turn needs, not the entire object, to keep the prompt lean.
Backend truth overrides inferred truth. When your system knows a fact, inject the fact verbatim rather than letting the model guess. This mirrors the source-of-truth discipline in A Reusable Model for Tracking Dialogue State in Prompts.
The instruction references state by field name. "Do not re-present offers in offers_declined" beats vague guidance like "avoid repeating yourself."

Updating State Across Turns

State is only useful if it tracks reality as the conversation moves.

The update checklist

Updates come from authoritative events. Payment status changes when the payment system says so, not when the model claims success.
User revisions overwrite cleanly. When a user changes an earlier answer, replace the slot and flag dependents for re-confirmation.
State updates happen before the next prompt is built. Stale state injected into the next turn is the most common drift bug.
Out-of-order input is tolerated. Users do not fill slots in your preferred sequence; accept fields whenever they arrive.

Guarding Against Drift and Contradiction

The failures users notice most are contradictions, so guard against them explicitly.

The guardrail checklist

Negative constraints are anchored to state. Tell the model what not to do based on specific fields — do not re-ask, do not re-suggest, do not re-charge.
Decisions are sticky. Once renewal_status is agreed, downstream prompts must not reopen the objection-handling path.
Conflicts resolve toward the system, not the model. If the model's output disagrees with canonical state, the system value wins.
You log the exact injected state. When something goes wrong, you need to see what the model actually received, not what you assumed.

Validating Before You Ship

A checklist is only as good as the testing behind it.

The validation checklist

You have replay tests for known-bad conversations. Capture past failures and assert they no longer recur.
You test the long-conversation case. Many state bugs only appear after a dozen turns when history grows.
You measure, not just eyeball. Track re-ask rate and contradiction rate, drawing on Reading the Signal: Metrics for Dialogue State in Prompts.
You decided build versus buy deliberately. Whether you hand-roll state or adopt a framework should be a choice, informed by Tooling That Tracks Conversation State Across Prompt Turns.

Triage: Matching Symptoms to Checklist Sections

When an existing assistant misbehaves, you do not want the whole checklist — you want the three items that fix the symptom in front of you. This table-of-contents-by-symptom turns the checklist into a fast diagnostic.

Symptom-to-section map

"It keeps asking me for things I already told it." Go to the injection section. The fact is in storage but is not reaching the prompt, or the instruction is not referencing it by field name.
"It suggested something I already tried." Go to the update section. You are likely not maintaining an attempted-actions list, so the model has no record to avoid.
"It contradicted what it said earlier." Go to the guardrails section. A finalized decision is not being treated as sticky, or the model's output is overriding canonical state.
"It got confused about which thing I meant." Go to the representation section. You probably lack a named focused-entity field for pronoun resolution.

Why symptom-driven triage beats a full sweep

Running every item against a live incident wastes time and buries the actual cause under noise. The failures of multi-turn assistants are specific enough that the symptom almost always points to one section. The metrics in Reading the Signal: Metrics for Dialogue State in Prompts let you confirm the diagnosis with a number rather than a hunch.

Adapting the Checklist to Your Stakes

A checklist that ignores context becomes bureaucracy. The right number of items to enforce scales with how much state your conversations carry and how costly errors are.

Calibrating effort

Low-stakes, short conversations: enforce the injection basics and skip most guardrails. The transcript approach from Transcript, Summary, or Slots: Deciding How Prompts Hold State may even let you skip structured state entirely.
High-stakes or long conversations: enforce the full list, especially the guardrails and validation items, because the cost of a single contradiction is high.
Agentic assistants that take actions: add extra weight to the update and guardrail sections, since a repeated action can mean a duplicated charge or a doubled API call.

The checklist is a tool, not a ritual. Apply the items that earn their place and drop the ones that do not.

Frequently Asked Questions

Do I need every item on this checklist?

No. Short, low-stakes assistants can skip the heavier guardrails. The checklist is a menu; apply the items proportional to conversation length and error cost.

What is the single most important item?

Injecting backend truth verbatim. The moment the model has to infer a fact your system already knows, you have introduced an avoidable failure point.

How do I handle a field the model is unsure about?

Keep it null and instruct the model to collect it. Ambiguity should resolve toward asking, not toward guessing and proceeding.

Should the checklist run automatically or manually?

Both. Use it as a manual review gate for new features and encode the testable items — re-ask rate, contradiction rate — into automated checks.

How does this differ from a regular QA checklist?

It targets the specific failure modes of multi-turn state: forgetting, re-asking, contradicting, and mis-resolving references. Generic QA rarely probes these.

What if my assistant has no persistent state at all?

Then most of this does not apply, and that is fine. The checklist scales with the amount of state a conversation must carry.

Key Takeaways

Represent state as structured, named fields with explicit value spaces and explicit nulls.
Inject state in a labeled block and have instructions reference fields by name.
Update state from authoritative events before building the next prompt, and tolerate out-of-order input.
Anchor negative constraints to state to prevent re-asking, re-suggesting, and contradiction.
Validate with replay tests, long-conversation tests, and measured re-ask and contradiction rates.
Apply checklist items in proportion to conversation length and the cost of errors.

Representing State

How you model state determines how reliably the rest of the system behaves.

The representation checklist

State is structured, not prose. Use named fields, not a paragraph describing what happened — the model parses fields far more reliably than narrative.
Every field has a defined value space. A payment_status that can only be pending, authorized, or completed constrains the model better than a free-text status.
Null is explicit. Unfilled slots should read null, not be absent, so the model can distinguish "not yet collected" from "irrelevant."
The focused entity is named. If users use pronouns, maintain a focused_entity field so references resolve consistently.

Injecting State Into the Prompt

State that exists in your database but never reaches the prompt does nothing.

The injection checklist

State appears in a labeled block. A clear header like CURRENT STATE: lets the model locate it instantly.
Only relevant state is injected. Send what the current turn needs, not the entire object, to keep the prompt lean.
Backend truth overrides inferred truth. When your system knows a fact, inject the fact verbatim rather than letting the model guess. This mirrors the source-of-truth discipline in A Reusable Model for Tracking Dialogue State in Prompts.
The instruction references state by field name. "Do not re-present offers in offers_declined" beats vague guidance like "avoid repeating yourself."

Updating State Across Turns

State is only useful if it tracks reality as the conversation moves.

The update checklist

Updates come from authoritative events. Payment status changes when the payment system says so, not when the model claims success.
User revisions overwrite cleanly. When a user changes an earlier answer, replace the slot and flag dependents for re-confirmation.
State updates happen before the next prompt is built. Stale state injected into the next turn is the most common drift bug.
Out-of-order input is tolerated. Users do not fill slots in your preferred sequence; accept fields whenever they arrive.

Guarding Against Drift and Contradiction

The failures users notice most are contradictions, so guard against them explicitly.

The guardrail checklist

Negative constraints are anchored to state. Tell the model what not to do based on specific fields — do not re-ask, do not re-suggest, do not re-charge.
Decisions are sticky. Once renewal_status is agreed, downstream prompts must not reopen the objection-handling path.
Conflicts resolve toward the system, not the model. If the model's output disagrees with canonical state, the system value wins.
You log the exact injected state. When something goes wrong, you need to see what the model actually received, not what you assumed.

Validating Before You Ship

A checklist is only as good as the testing behind it.

The validation checklist

You have replay tests for known-bad conversations. Capture past failures and assert they no longer recur.
You test the long-conversation case. Many state bugs only appear after a dozen turns when history grows.
You measure, not just eyeball. Track re-ask rate and contradiction rate, drawing on Reading the Signal: Metrics for Dialogue State in Prompts.
You decided build versus buy deliberately. Whether you hand-roll state or adopt a framework should be a choice, informed by Tooling That Tracks Conversation State Across Prompt Turns.

Triage: Matching Symptoms to Checklist Sections

Symptom-to-section map

"It keeps asking me for things I already told it." Go to the injection section. The fact is in storage but is not reaching the prompt, or the instruction is not referencing it by field name.
"It suggested something I already tried." Go to the update section. You are likely not maintaining an attempted-actions list, so the model has no record to avoid.
"It contradicted what it said earlier." Go to the guardrails section. A finalized decision is not being treated as sticky, or the model's output is overriding canonical state.
"It got confused about which thing I meant." Go to the representation section. You probably lack a named focused-entity field for pronoun resolution.

Why symptom-driven triage beats a full sweep

Adapting the Checklist to Your Stakes

A checklist that ignores context becomes bureaucracy. The right number of items to enforce scales with how much state your conversations carry and how costly errors are.

Calibrating effort

Low-stakes, short conversations: enforce the injection basics and skip most guardrails. The transcript approach from Transcript, Summary, or Slots: Deciding How Prompts Hold State may even let you skip structured state entirely.
High-stakes or long conversations: enforce the full list, especially the guardrails and validation items, because the cost of a single contradiction is high.
Agentic assistants that take actions: add extra weight to the update and guardrail sections, since a repeated action can mean a duplicated charge or a doubled API call.

The checklist is a tool, not a ritual. Apply the items that earn their place and drop the ones that do not.

Frequently Asked Questions

Do I need every item on this checklist?

No. Short, low-stakes assistants can skip the heavier guardrails. The checklist is a menu; apply the items proportional to conversation length and error cost.

What is the single most important item?

Injecting backend truth verbatim. The moment the model has to infer a fact your system already knows, you have introduced an avoidable failure point.

How do I handle a field the model is unsure about?

Keep it null and instruct the model to collect it. Ambiguity should resolve toward asking, not toward guessing and proceeding.

Should the checklist run automatically or manually?

Both. Use it as a manual review gate for new features and encode the testable items — re-ask rate, contradiction rate — into automated checks.

How does this differ from a regular QA checklist?

It targets the specific failure modes of multi-turn state: forgetting, re-asking, contradicting, and mis-resolving references. Generic QA rarely probes these.

What if my assistant has no persistent state at all?

Then most of this does not apply, and that is fine. The checklist scales with the amount of state a conversation must carry.

Key Takeaways

Represent state as structured, named fields with explicit value spaces and explicit nulls.
Inject state in a labeled block and have instructions reference fields by name.
Update state from authoritative events before building the next prompt, and tolerate out-of-order input.
Anchor negative constraints to state to prevent re-asking, re-suggesting, and contradiction.
Validate with replay tests, long-conversation tests, and measured re-ask and contradiction rates.
Apply checklist items in proportion to conversation length and the cost of errors.

Concrete Scenarios That Reveal Whether Your Dialogue State Holds

Representing State

The representation checklist

Injecting State Into the Prompt

The injection checklist

Updating State Across Turns

The update checklist

Guarding Against Drift and Contradiction

The guardrail checklist

Validating Before You Ship

The validation checklist

Triage: Matching Symptoms to Checklist Sections

Symptom-to-section map

Why symptom-driven triage beats a full sweep

Adapting the Checklist to Your Stakes

Calibrating effort

Frequently Asked Questions

Do I need every item on this checklist?

What is the single most important item?

How do I handle a field the model is unsure about?

Should the checklist run automatically or manually?

How does this differ from a regular QA checklist?

What if my assistant has no persistent state at all?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Concrete Scenarios That Reveal Whether Your Dialogue State Holds

Representing State

The representation checklist

Injecting State Into the Prompt

The injection checklist

Updating State Across Turns

The update checklist

Guarding Against Drift and Contradiction

The guardrail checklist

Validating Before You Ship

The validation checklist

Triage: Matching Symptoms to Checklist Sections

Symptom-to-section map

Why symptom-driven triage beats a full sweep

Adapting the Checklist to Your Stakes

Calibrating effort

Frequently Asked Questions

Do I need every item on this checklist?

What is the single most important item?

How do I handle a field the model is unsure about?

Should the checklist run automatically or manually?

How does this differ from a regular QA checklist?

What if my assistant has no persistent state at all?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?