A good checklist turns a fuzzy decision into a sequence of yes-or-no questions you can answer with evidence. This one is built to be used, not admired. Each item comes with a short justification so you understand why it matters and can defend the answer it produces.
Run the checklist top to bottom for each workload you are deciding on. Do not treat your whole AI footprint as one decision; the best answer often differs per task. Keep your completed checklist as the written rationale behind your choice.
Before You Compare Any Model
These items establish whether you even have the information needed to decide. Skipping them is how teams end up arguing in circles.
Workload Definition
- Documented daily volume and burstiness. Volume drives the cost answer more than any other factor; without it you are guessing.
- Defined latency budget, including worst case. A model that passes in isolation can fail under real concurrency.
- Classified data sensitivity. Whether data falls under residency or regulatory rules can disqualify entire options immediately.
- Rated task difficulty. Frontier-level reasoning and routine extraction lead to opposite recommendations.
Hard Constraints That Eliminate Options
These are the disqualifiers. Check them early so you never waste a bake-off on a model that can never qualify.
Constraint Screen
- Does any data require an architectural residency guarantee? If yes, basic closed APIs are out; only self-hosted open weights satisfy it.
- Is there a launch deadline that rules out building infrastructure? If yes, self-hosting open models is out for now.
- Does the model's license permit your commercial use and scale? Some open-weight licenses cap usage or restrict applications; read the actual terms.
- Do your auditors accept contractual privacy guarantees? If not, a closed no-retention tier will not clear your compliance bar.
Cost Modeling
Cost decided wrong is the most common regret. These items force an honest comparison rather than a tempting shortcut.
Total Cost of Ownership
- Closed path priced at projected volume. Per-token cost times realistic monthly tokens.
- Open path priced including GPUs and engineering time. The infrastructure bill plus the senior-engineer hours to run it; this is where teams fool themselves.
- Compared on cost per successful task, not per token. A cheaper model that fails more often is not cheaper. Our common mistakes article explains this trap.
Evaluation and Selection
You cannot choose on benchmarks or vibes. These items make the choice evidence-based.
Evidence Gathering
- Built an eval set of 30 to 100 real examples. This measures your task, not a public leaderboard's task.
- Ran a bake-off including at least one open and one closed candidate. A real comparison, not a single-option justification.
- Scored quality, latency at concurrency, and consistency on edge cases. Quality on easy inputs hides the failures that matter.
The step-by-step approach walks through building this eval set and bake-off in order.
Operational Readiness
This section is where the open path quietly succeeds or fails. Be honest.
Can You Actually Run It?
- Confirmed your team can own inference reliability. GPU availability, autoscaling, observability, and patching are ongoing, not one-time.
- Considered managed open-model hosting as a middle ground. Open-weight benefits without raw infrastructure ownership, if your team lacks operational muscle.
- Planned for version migration on both paths. Closed providers deprecate on their schedule; open self-hosting defers upgrades until forced.
Architecture and Future-Proofing
These items protect the decision from going stale or becoming expensive to change.
Build for Change
- Abstracted model calls behind a thin interface. Makes swapping models a contained change rather than codebase surgery.
- Designed for a routed portfolio if workloads differ. Cheapest capable model per task, not one model for everything.
- Scheduled a review every three to six months. Capability and pricing move fast; a stale decision quietly becomes wrong.
For the structure behind a routed portfolio, see our framework article.
Decision Triggers That Should Reopen the Checklist
A model decision is not permanent, and certain events should send you back through the list. Watching for these prevents a once-correct choice from quietly becoming wrong.
Reopen When Any of These Happen
- Volume crosses an order of magnitude. The cost math that favored a closed API at one million tokens a day can flip decisively at ten million. A tenfold change in usage is the single most common trigger for a migration.
- A new compliance requirement lands. An enterprise prospect or a regulatory change can introduce a residency requirement that disqualifies your current path overnight, regardless of how well it performed before.
- A major model release shifts the frontier. When a new open-weight model closes the gap with the closed frontier on your task, your Capability assumption may no longer hold and the cheaper option may now suffice.
- Provider pricing or deprecation changes. A price increase or a deprecated version forces a re-evaluation whether you wanted one or not, so it is better to run it deliberately than reactively.
Common Ways the Checklist Gets Misused
Even a good checklist fails when applied carelessly. Two patterns recur. The first is checking items off optimistically without evidence: marking "team can own inference reliability" as yes because it feels achievable, not because anyone has actually run production inference. The cure is to demand proof for each item, not aspiration.
The second is treating the checklist as a one-time gate rather than a living document. Teams run it once at launch, file it away, and never revisit. The triggers above exist precisely because the right answer drifts. Keep your completed checklist where you will see it at each scheduled review, and update the answers rather than starting from a blank slate each time.
Turning the Checklist Into a Team Habit
A checklist that lives in one person's head fails the moment that person changes roles. To make it durable, attach it to your process. Add it as a required artifact in your design-review template, so no AI workload ships without a completed checklist attached. This converts the checklist from optional discipline into a standard everyone follows by default.
The payoff compounds. Once several workloads have documented checklists, you have an institutional record of why each model decision was made. New team members can read the rationale instead of relitigating it, and your scheduled reviews become quick updates rather than fresh investigations. The checklist stops being a one-off tool and becomes the connective tissue of how your organization reasons about models.
Make It Stick
- Embed it in design reviews so a completed checklist is required before any AI workload ships.
- Store completed checklists together as a searchable record of past decisions and their rationale.
- Assign an owner for the recurring review so it never quietly lapses.
Frequently Asked Questions
Should I run this checklist once for my whole company?
No. Run it per workload. The best answer for a high-volume extraction pipeline differs from the answer for a low-volume frontier-reasoning task. Treating everything as one decision produces a worse outcome for most of your tasks.
Which checklist item do teams most often skip?
Pricing the open path with engineering time included. Teams price the GPUs and stop, ignoring the senior-engineer hours that dominate the real cost. That omission is the single most common source of "open turned out more expensive" regret.
Do I need every item checked before deciding?
The hard-constraint and cost items are non-negotiable. The future-proofing items, like the thin interface, are strongly recommended but can be staged. The eval and bake-off items are essential before any production commitment.
How do I keep the checklist relevant in 2026 and beyond?
The structure is durable because it is built on workload properties, not specific models. As new models ship, only your bake-off candidates change. Re-running the checklist on your review schedule keeps the decision current without rewriting it.
Key Takeaways
- Run the checklist per workload, not once for the whole organization.
- Define volume, latency, data sensitivity, and difficulty before comparing any model.
- Screen hard constraints early to eliminate disqualified options immediately.
- Price the open path with engineering time and compare on cost per successful task.
- Abstract model calls, plan for migration, and schedule a recurring review.