A checklist is only useful if you actually run it, so this one is built to be run. Every item is a yes-or-no check with a short justification, grouped by the stage of the project where it applies. Work through it before you trust a model's numbers, and again before you deploy. Most generalization disasters fail at least one of these checks, usually one of the early ones.
This is a working tool, not a reading. Keep it open next to your project and tick items off. The justifications exist so that when you are tempted to skip a step, you remember the specific cost. The cost is almost always paid in production, where it is most expensive to fix.
The checklist is current for 2026 practice, but the fundamentals it encodes are stable. Tooling changes; the bias-variance trade-off does not.
Stage 1: Data and Split Setup
Get this stage wrong and every later number is meaningless. These checks come first because leakage and bad splits invalidate everything downstream.
- [ ] Data is split into train, validation, and a final test set. You need three partitions: one to learn, one to tune, one to judge.
- [ ] The test set is locked and untouched until the end. Every peek contaminates it and inflates your estimate.
- [ ] All preprocessing is fit on training data only. Fitting on the full dataset leaks information and hides overfitting.
- [ ] No feature encodes the target or future information. Leaked features produce models that detect the answer instead of predicting it.
- [ ] Time-dependent data uses a chronological split. Random splits train on the future and leak; they are silently invalid for time series.
The leakage checks here prevent the most common silent failure, expanded in 7 Common Mistakes with Ai Model Overfitting and Underfitting.
Stage 2: Baseline and Diagnosis
Before you tune anything, establish where you sit on the bias-variance spectrum. Skipping diagnosis is how people apply the wrong fix.
- [ ] A simple baseline model was trained and recorded. Without a reference you cannot tell whether complexity helps.
- [ ] Training error and validation error are both recorded. The gap between them is your primary diagnostic.
- [ ] You have classified the model as overfitting, underfitting, or well-fit. Both errors high means underfitting; a large gap means overfitting.
- [ ] Learning curves were inspected, not just final numbers. Curves reveal whether more data would help.
- [ ] Fold-to-fold variance from cross-validation was checked. Wide swings signal overfitting hidden behind a decent average.
The diagnostic logic behind these checks is laid out fully in The Complete Guide to Ai Model Overfitting and Underfitting.
Stage 3: Treatment Matched to Diagnosis
Apply only the fixes your diagnosis points to. The remedies for overfitting and underfitting are opposites, so the wrong one makes things worse.
If Overfitting
- [ ] More data was considered first. It is the most durable variance reducer.
- [ ] Regularization was tuned against validation. L2, dropout, or early stopping, applied with intent, not reflex.
- [ ] Model capacity was reduced if appropriate. Shallower, fewer parameters, lower degree.
If Underfitting
- [ ] Model capacity was increased. More layers, more trees, richer features.
- [ ] Features were engineered to add missing signal. The model cannot use information it never receives.
- [ ] Regularization was reduced, not added. Penalizing a too-simple model deepens the failure.
In both cases: [ ] Only one change was made at a time, with re-measurement after each. Batching changes destroys your ability to attribute cause. The full sequence is in A Step-by-Step Approach to Ai Model Overfitting and Underfitting.
Stage 4: Validation Robustness
A single split can mislead. These checks confirm your evaluation is stable enough to trust.
- [ ] k-fold cross-validation was used, not a single split. It gives a stable estimate and exposes variance.
- [ ] Metrics match the problem. Accuracy hides failure on imbalanced data; use precision, recall, F1, or AUC as fitting.
- [ ] Performance was checked per segment, not just in aggregate. A model can underfit one segment while overfitting another.
- [ ] Validation data resembles production data. A validation set that shares a flaw with training hides the problem.
Stage 5: Pre-Deployment and Beyond
The final gate before shipping, plus the checks that keep the model honest after launch.
- [ ] The final test set was evaluated exactly once. This number is your honest production estimate.
- [ ] No tuning happened after looking at the test set. Tuning against it turns the estimate into fiction.
- [ ] A monitoring plan tracks live performance against the offline estimate. Distribution drift erodes generalization over time.
- [ ] A retraining trigger is defined. A silently decaying model is overfitting to a past that no longer exists.
These deployment-stage checks reflect the durable practices in Ai Model Overfitting and Underfitting: Best Practices That Actually Work.
How to Use This Checklist in Practice
A checklist works only as a habit, so build it into your workflow rather than treating it as a one-time review. Run Stage 1 and Stage 2 as you set up the project, not at the end, because catching a leakage or split problem after weeks of modeling means throwing that work away. Treat the early stages as gates: do not proceed until they pass.
Two Passes, Two Moments
Run the checklist twice. The first pass happens during development, after your data is split and before you start tuning, to confirm the foundation is sound. The second pass happens just before deployment, to confirm the treatment, validation, and pre-deployment items are all satisfied.
Between those passes, keep the treatment items visible while you iterate, because the one-change-at-a-time rule and the diagnosis-before-treatment rule are easy to forget under pressure. A printed or pinned copy nearby is worth more than a checklist you read once and close.
Make Failures Loud
When an item fails, stop and fix it rather than noting it and moving on. The whole value of a checklist is that it forces failures into the open before they reach production. A check you waive under deadline pressure is exactly the check that breaks the model later, usually at the worst possible time.
Frequently Asked Questions
Do I need to run every item on every project?
The Stage 1 and Stage 2 checks are non-negotiable on every project, because they prevent silent invalidation of your results. The treatment and monitoring checks scale with stakes: a quick prototype may skip post-deployment monitoring, but anything customer-facing should run the full list.
What is the most commonly failed check?
Fitting preprocessing on training data only, and its cousin, locking the test set. Both are skipped under time pressure and both silently inflate your scores. They produce no error message, so the only protection is deliberately verifying them, which is why they appear first.
How is this checklist different for 2026 versus earlier?
The tooling around automated cross-validation, drift monitoring, and pipeline enforcement has matured, making several checks easier to enforce structurally rather than by hand. The underlying principles, the bias-variance trade-off and the need for honest validation, are unchanged and will remain so.
Can I automate parts of this checklist?
Yes. Preprocessing-in-pipeline, cross-validation, and drift monitoring are all automatable, and automating them is the best way to ensure they actually happen. The diagnostic judgment, classifying overfitting versus underfitting and deciding when to stop, still benefits from human review.
What should I do if a check fails late in the project?
Stop and fix it before proceeding, even if it means rework. A failed Stage 1 check in particular invalidates everything after it, so continuing is wasted effort. Late failures are painful, but shipping a model built on a failed check is far more painful when it breaks in production.
Key Takeaways
- Run Stage 1 and Stage 2 checks on every project; they prevent silent invalidation.
- Lock the test set and fit all preprocessing on training data only.
- Diagnose overfitting versus underfitting before applying any treatment.
- Match the fix to the diagnosis and change one thing at a time.
- Use cross-validation and problem-appropriate metrics, and check per segment.
- Evaluate the test set once, then monitor for drift and define a retraining trigger.