Pre-Flight Checks for Catching Model Errors

A checklist is only useful if you actually run it. This one is built to be run: short enough to walk through before every error-detection task, specific enough that each item changes what you do, and justified so you know which items you can safely skip when the stakes are low.

The structure follows the natural arc of an error-detection task. You set up the prompt, you run detection, you run correction, you verify, and you decide whether the output is fit to ship. Each phase has a small number of checks. None is filler. If you can answer yes to every item that applies to your stakes, the output is trustworthy. If you cannot, you have found exactly where the risk lives.

Use this alongside Hard-Won Rules for Error-Checking Prompts That Hold Up, which explains the reasoning behind these items in more depth. The checklist is the compressed, operational version of those practices.

Before You Prompt: Setup Checks

These checks decide whether the prompt can possibly succeed.

The setup list

Have you defined what counts as an error for this task? Without a taxonomy the model invents its own and floods you with false positives.
Have you supplied the source of truth inline? A model cannot detect drift from a standard it never received.
Have you stated what is intentional and off-limits? Naming intentional fragments or casing prevents the model from "fixing" voice.
Have you set an edit budget? A smallest-viable-change rule stops overcorrection before it starts.

During Detection: Diagnostic Checks

These checks keep the detection pass honest.

The detection list

Did you ask for reasons, not just flags? A reason you can audit is a reason you can reject; a bare flag is not.
Did you require a confidence level per item? Confidence turns an undifferentiated list into a triage queue.
Did you forbid outside knowledge? This blocks the model from "correcting" facts to whatever it remembers. The cost of skipping it appears in Seven Ways Error-Detection Prompts Quietly Fail You.
For long inputs, did you chunk before detecting? Attention thins across long documents, so late errors slip through a single giant prompt.

During Correction: Repair Checks

These checks keep the correction pass surgical.

The correction list

Is correction a separate pass from detection? Bundling them costs you the audit trail and invites silent rewrites.
Did each correction map to a specific flagged error? An unmapped change is scope creep wearing a correction's clothing.
Did you cap the change at the minimum that fixes the error? Review the diff, not the rewrite, so creep is visible.

After Correction: Verification Checks

These checks decide whether the corrected output is actually better.

The verification list

Did you run a verification pass on the corrected output? Correction can introduce new errors; only a second pass catches them.
For code or data, did the tests or source validation pass? An automated check is the strongest verification you have.
Did you run a cross-section consistency pass on long inputs? Contradictions that span chunks are invisible to any single chunk.

During Setup: Input Preparation Checks

How you prepare the input is as decisive as how you word the prompt.

The preparation list

Did you strip irrelevant context that could distract the model? Extra material dilutes attention and lowers catch rate on the parts that matter.
Did you format the input so locations are referenceable? Numbered lines, headings, or paragraph markers let the model point at exactly where an error lives, which makes its flags verifiable.
Did you note the input's provenance? Knowing whether text came from a client, a model, or a scrape tells you which error classes are most likely and lets you weight the taxonomy accordingly.
Did you separate content from instructions? Mixing the document to check with the checking instructions invites the model to treat your instructions as text to correct.

Why preparation earns its place

A well-prepared input does half the work of a good prompt. The same prompt run on a clean, referenceable input catches more and fabricates less than it does on a raw paste, because the model spends its attention on detection rather than on parsing structure.

During the Run: Monitoring Checks

A check that runs unobserved can drift without anyone noticing.

The monitoring list

Did you spot-check a sample of flags against ground truth? A quick sample tells you whether the prompt is still calibrated or has started over- or under-flagging.
Did you watch for the prompt flagging its own instructions? When a model starts correcting your prompt text, the content-instruction separation has broken down.
Did you note any new error type that escaped? Every escape is a candidate to add to your known-bad set, the practice behind The DETECT Loop: A Reusable Model for Catching AI Errors.
Did you confirm the confidence ratings look discriminating rather than uniformly high? Uniform confidence means the signal has collapsed and triage is no longer meaningful.

Why monitoring matters

Prompts degrade silently as inputs shift. A model that was well calibrated on last quarter's material can quietly drift on this quarter's, and the only way to catch that drift is to keep watching a sample rather than trusting the workflow on autopilot.

Before You Ship: Decision Checks

These checks gate the output against your stakes.

The decision list

Were all low-confidence items reviewed by a person? Confidence is triage, not a guarantee.
Does the stakes level justify the checks you skipped? A personal draft can skip verification; a client deliverable cannot.
Can you show the audit trail if challenged? A defensible record of flags, reasons, and sources is itself a deliverable, as shown in How a Content Team Cut Proofing Errors With Staged Prompts.

Frequently Asked Questions

Which checklist items can I skip for low-stakes work?

The verification pass and the human review of low-confidence items are the usual candidates for a personal draft. For anything client-facing or production-bound, treat every applicable item as mandatory.

Why does the checklist separate detection from correction?

Because bundling them destroys the audit trail and lets the model rewrite silently. Separating the passes lets you inspect the model's reasoning before any text changes, which is where bad corrections are cheapest to catch.

How do I use the confidence checks in practice?

Require the model to rate each flagged item, then route anything below high confidence to a person. The rating is a triage signal that tells you where human attention pays off, not a substitute for it.

Do I really need a source of truth every time?

For any factual or specification-bound task, yes. Without a reference the model checks against stale training data and may overwrite accurate current values. For pure grammar checks on original prose, a defined error taxonomy may be enough.

What is the single most skipped item that causes the most damage?

The verification pass. Teams trust the first correction because it reads cleanly, and that is exactly how a model-introduced error reaches the client. The second pass is cheap and catches it.

Can this checklist work for code review prompts?

Yes. The setup, detection, correction, and verification phases all map onto code. The key difference is that your verification pass is the test suite, which is the strongest form of verification available.

How to Adopt the Checklist Without Friction

A checklist nobody runs is worthless, so adoption design matters as much as the items.

Making it stick

Embed the checklist where the work happens, in the document template or the pull request description, so it is in front of people at the moment of decision rather than buried in a wiki.
Mark which items are mandatory for which stakes level, so a quick draft does not carry the full ceremony and a client deliverable cannot skip it.
Review escaped errors against the checklist, asking which item would have caught each one, and add new items only when an escape reveals a genuine gap.
Keep it short enough to run in under a minute, because length is the enemy of adoption.

Why this closes the loop

A living checklist that grows from real escapes stays relevant, while a static one ossifies and gets ignored. Treating it as a tool you tune, the same way you tune prompts against a known-bad set in The Numbers That Tell You an Error-Detection Prompt Works, is what keeps it earning its place in the workflow.

Key Takeaways

Define the error taxonomy and supply the source of truth before you prompt.
Ask for reasons and confidence levels so detection is auditable and triageable.
Chunk long inputs and add a cross-section consistency pass.
Keep correction a separate, minimal pass mapped to specific flagged errors.
Always verify the corrected output, using tests for code and data.
Gate the output against your stakes and keep a defensible audit trail.

Before You Prompt: Setup Checks

These checks decide whether the prompt can possibly succeed.

The setup list

Have you defined what counts as an error for this task? Without a taxonomy the model invents its own and floods you with false positives.
Have you supplied the source of truth inline? A model cannot detect drift from a standard it never received.
Have you stated what is intentional and off-limits? Naming intentional fragments or casing prevents the model from "fixing" voice.
Have you set an edit budget? A smallest-viable-change rule stops overcorrection before it starts.

During Detection: Diagnostic Checks

These checks keep the detection pass honest.

The detection list

Did you ask for reasons, not just flags? A reason you can audit is a reason you can reject; a bare flag is not.
Did you require a confidence level per item? Confidence turns an undifferentiated list into a triage queue.
Did you forbid outside knowledge? This blocks the model from "correcting" facts to whatever it remembers. The cost of skipping it appears in Seven Ways Error-Detection Prompts Quietly Fail You.
For long inputs, did you chunk before detecting? Attention thins across long documents, so late errors slip through a single giant prompt.

During Correction: Repair Checks

These checks keep the correction pass surgical.

The correction list

Is correction a separate pass from detection? Bundling them costs you the audit trail and invites silent rewrites.
Did each correction map to a specific flagged error? An unmapped change is scope creep wearing a correction's clothing.
Did you cap the change at the minimum that fixes the error? Review the diff, not the rewrite, so creep is visible.

After Correction: Verification Checks

These checks decide whether the corrected output is actually better.

The verification list

Did you run a verification pass on the corrected output? Correction can introduce new errors; only a second pass catches them.
For code or data, did the tests or source validation pass? An automated check is the strongest verification you have.
Did you run a cross-section consistency pass on long inputs? Contradictions that span chunks are invisible to any single chunk.

During Setup: Input Preparation Checks

How you prepare the input is as decisive as how you word the prompt.

The preparation list

Did you strip irrelevant context that could distract the model? Extra material dilutes attention and lowers catch rate on the parts that matter.
Did you format the input so locations are referenceable? Numbered lines, headings, or paragraph markers let the model point at exactly where an error lives, which makes its flags verifiable.
Did you note the input's provenance? Knowing whether text came from a client, a model, or a scrape tells you which error classes are most likely and lets you weight the taxonomy accordingly.
Did you separate content from instructions? Mixing the document to check with the checking instructions invites the model to treat your instructions as text to correct.

Why preparation earns its place

During the Run: Monitoring Checks

A check that runs unobserved can drift without anyone noticing.

The monitoring list

Did you spot-check a sample of flags against ground truth? A quick sample tells you whether the prompt is still calibrated or has started over- or under-flagging.
Did you watch for the prompt flagging its own instructions? When a model starts correcting your prompt text, the content-instruction separation has broken down.
Did you note any new error type that escaped? Every escape is a candidate to add to your known-bad set, the practice behind The DETECT Loop: A Reusable Model for Catching AI Errors.
Did you confirm the confidence ratings look discriminating rather than uniformly high? Uniform confidence means the signal has collapsed and triage is no longer meaningful.

Why monitoring matters

Before You Ship: Decision Checks

These checks gate the output against your stakes.

The decision list

Were all low-confidence items reviewed by a person? Confidence is triage, not a guarantee.
Does the stakes level justify the checks you skipped? A personal draft can skip verification; a client deliverable cannot.
Can you show the audit trail if challenged? A defensible record of flags, reasons, and sources is itself a deliverable, as shown in How a Content Team Cut Proofing Errors With Staged Prompts.

Frequently Asked Questions

Which checklist items can I skip for low-stakes work?

Why does the checklist separate detection from correction?

How do I use the confidence checks in practice?

Do I really need a source of truth every time?

What is the single most skipped item that causes the most damage?

The verification pass. Teams trust the first correction because it reads cleanly, and that is exactly how a model-introduced error reaches the client. The second pass is cheap and catches it.

Can this checklist work for code review prompts?

How to Adopt the Checklist Without Friction

A checklist nobody runs is worthless, so adoption design matters as much as the items.

Making it stick

Embed the checklist where the work happens, in the document template or the pull request description, so it is in front of people at the moment of decision rather than buried in a wiki.
Mark which items are mandatory for which stakes level, so a quick draft does not carry the full ceremony and a client deliverable cannot skip it.
Review escaped errors against the checklist, asking which item would have caught each one, and add new items only when an escape reveals a genuine gap.
Keep it short enough to run in under a minute, because length is the enemy of adoption.

Why this closes the loop

Key Takeaways

Define the error taxonomy and supply the source of truth before you prompt.
Ask for reasons and confidence levels so detection is auditable and triageable.
Chunk long inputs and add a cross-section consistency pass.
Keep correction a separate, minimal pass mapped to specific flagged errors.
Always verify the corrected output, using tests for code and data.
Gate the output against your stakes and keep a defensible audit trail.

Pre-Flight Checks for Catching Model Errors

Before You Prompt: Setup Checks

The setup list

During Detection: Diagnostic Checks

The detection list

During Correction: Repair Checks

The correction list

After Correction: Verification Checks

The verification list

During Setup: Input Preparation Checks

The preparation list

Why preparation earns its place

During the Run: Monitoring Checks

The monitoring list

Why monitoring matters

Before You Ship: Decision Checks

The decision list

Frequently Asked Questions

Which checklist items can I skip for low-stakes work?

Why does the checklist separate detection from correction?

How do I use the confidence checks in practice?

Do I really need a source of truth every time?

What is the single most skipped item that causes the most damage?

Can this checklist work for code review prompts?

How to Adopt the Checklist Without Friction

Making it stick

Why this closes the loop

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?

Pre-Flight Checks for Catching Model Errors

Before You Prompt: Setup Checks

The setup list

During Detection: Diagnostic Checks

The detection list

During Correction: Repair Checks

The correction list

After Correction: Verification Checks

The verification list

During Setup: Input Preparation Checks

The preparation list

Why preparation earns its place

During the Run: Monitoring Checks

The monitoring list

Why monitoring matters

Before You Ship: Decision Checks

The decision list

Frequently Asked Questions

Which checklist items can I skip for low-stakes work?

Why does the checklist separate detection from correction?

How do I use the confidence checks in practice?

Do I really need a source of truth every time?

What is the single most skipped item that causes the most damage?

Can this checklist work for code review prompts?

How to Adopt the Checklist Without Friction

Making it stick

Why this closes the loop

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

Case Study: Large Language Models in Practice

Thirty-Second Wins Breed False Confidence With LLMs

Ready to certify your AI capability?