Public Scores Cannot Tell You What Works on Your Tasks

Public benchmarks tell you which models are roughly in contention. They cannot tell you which one will perform best on your actual work, because they weren't built from your tasks. The only way to answer that question is to run your own evaluation. This guide gives you the concrete, ordered steps to do exactly that, starting from nothing today.

You don't need a research team or a dedicated platform to begin. A spreadsheet, a clear rubric, and a few dozen real tasks will outperform any public leaderboard for deciding what to ship. The discipline is in doing the steps in order and resisting the urge to skip the unglamorous ones, like writing the rubric before you look at any outputs.

Follow these steps in sequence. Each builds on the last, and skipping ahead is the most common way evaluations produce confident but wrong conclusions.

Step 1: Define the Decision You're Making

Before collecting a single task, write down what you're trying to decide. "Which model should we use for customer support reply drafts?" is a decision. "Which model is best?" is not, and it has no answer.

Pin down the constraints

A real decision comes with constraints. Note your hard limits up front: maximum acceptable latency, cost ceiling per request, whether the model must run in a specific region, and any compliance requirements. These will eliminate otherwise strong candidates before you waste time testing them.

Name the success criteria

Decide what "good" means in concrete terms. For support drafts that might be: factually correct, on-brand tone, no unsafe promises, under 150 words. Write these down now. If you wait until you're staring at outputs, you'll rationalize whatever the models produced.

Step 2: Assemble a Representative Task Set

Your evaluation is only as good as the tasks in it. The goal is a set that mirrors the real distribution of work the model will face.

Source from real data

Pull tasks from your actual logs, tickets, or documents, not from your imagination. Real inputs contain the messiness, edge cases, and ambiguity that synthetic examples miss. Aim for 50 to 200 tasks. Below 50 and your results are noisy; above 200 and manual scoring becomes a chore without much added confidence.

Cover the distribution deliberately

Make sure the set includes the easy common cases and the hard rare ones in roughly the proportion they occur. Then add a handful of known-tricky cases on purpose, the ones that have burned you before. You want the evaluation to surface failures, not hide them.

Step 3: Write the Scoring Rubric First

This is the step people skip, and it's the one that separates a real evaluation from a vibe check. Write your rubric before you generate any outputs.

Choose a scoring method

For tasks with a clear right answer, like extraction or classification, score exact match automatically. For open-ended tasks, like writing, use a rubric with 3 to 5 criteria each scored on a small scale, say 0 to 2. Keep the scale small; finer scales add disagreement without adding signal.

Decide who or what scores

You have three options: score by hand, use a strong model as a judge, or both. Manual scoring is most trustworthy and slowest. Model-as-judge scales but needs validation against human scores on a sample before you trust it. For a first pass, score by hand. It forces you to actually read the outputs.

Step 4: Run the Models Under Identical Conditions

Fairness lives in the details. The whole evaluation is invalid if you give one model an advantage you didn't give the others.

Hold everything constant

Use the same prompt, the same temperature, the same number of attempts, and the same tool access for every model. Record these settings. If you later change the prompt to help one model, you must rerun all of them. Our piece on 7 Common Mistakes with AI Model Benchmarks covers how subtly unfair setups creep in.

Capture the raw outputs

Save every model's full output for every task before you score anything. You'll want to re-read them when results are surprising, and you'll want a record if a stakeholder questions the conclusion.

Step 5: Score, Then Look at the Distribution

Now you score every output against the rubric from Step 3. But the average is not the whole story.

Go beyond the mean

A model can have a great average and still fail catastrophically on 5% of tasks. For many production uses, that tail matters more than the mean, especially if those failures are unsafe or visible to customers. Always look at the worst outputs, not just the average score.

Segment the results

Break scores down by task type. A model that wins overall might lose badly on your hardest segment, which could be the segment you care most about. The aggregate can hide exactly the information that should drive the decision.

Step 6: Account for Cost, Speed, and Variance

The highest-scoring model isn't automatically the right choice. Now you fold in the constraints from Step 1.

Weigh the trade-offs

A model that scores two points higher but costs three times as much and responds twice as slowly is often the wrong pick. Put quality, cost, and latency side by side and decide explicitly which you're optimizing. Make the trade-off a choice, not an accident.

Check run-to-run variance

Run your top candidates two or three times. If a model's score swings several points between runs, that instability is itself a finding, and it should lower your confidence in a narrow lead.

Step 7: Document and Re-Run Over Time

Finish by writing down the decision, the evidence, and the date. Models update, and a choice that was right three months ago may not be right now.

Keep your task set and rubric as a reusable asset. When a new model ships or a vendor pushes an update, rerun the same evaluation. This turns a one-time decision into an ongoing capability. For a formal structure to make this repeatable, see A Framework for AI Model Benchmarks, and to put it on a working checklist, use The AI Model Benchmarks Checklist for 2026.

Frequently Asked Questions

How many tasks do I need for a reliable evaluation?

Between 50 and 200 for most cases. Fewer than 50 and your scores are too noisy to trust narrow differences. More than 200 rarely changes the conclusion and makes manual scoring painful. Start at the low end and add tasks if results are close.

Should I use a model to score the outputs?

Model-as-judge scales well but only after you've validated it against human scores on a sample. For a first evaluation, score by hand so you actually read the outputs and understand the failure modes. Once you trust the judge on your task, you can automate.

What if two models score nearly the same?

Then quality isn't the deciding factor, and you should let cost, speed, or reliability break the tie. Also rerun both a few times to check whether the small gap is even stable. A two-point difference that flips between runs is not a difference.

How often should I redo the evaluation?

Whenever a model you use gets an update, when a promising new model ships, or at a regular cadence like quarterly. Models change behavior between versions, so a decision has a shelf life. Keeping your task set reusable makes re-running cheap.

Can I skip public benchmarks and just test myself?

You can, but public benchmarks are a useful free filter for deciding which models are worth your testing time. Use them to build a shortlist, then run your own evaluation on the shortlist. That's faster than testing every model on the market.

Key Takeaways

Start by defining the actual decision and its constraints, not a vague "which is best."
Build a task set of 50 to 200 real, representative examples including known-hard cases.
Write the scoring rubric before generating any outputs to avoid rationalizing results.
Run every model under identical conditions, then study the distribution and worst cases, not just the average.
Fold in cost, speed, and variance, document the decision, and rerun as models change.

Follow these steps in sequence. Each builds on the last, and skipping ahead is the most common way evaluations produce confident but wrong conclusions.

Step 1: Define the Decision You're Making

Pin down the constraints

Name the success criteria

Step 2: Assemble a Representative Task Set

Your evaluation is only as good as the tasks in it. The goal is a set that mirrors the real distribution of work the model will face.

Source from real data

Cover the distribution deliberately

Step 3: Write the Scoring Rubric First

This is the step people skip, and it's the one that separates a real evaluation from a vibe check. Write your rubric before you generate any outputs.

Choose a scoring method

Decide who or what scores

Step 4: Run the Models Under Identical Conditions

Fairness lives in the details. The whole evaluation is invalid if you give one model an advantage you didn't give the others.

Hold everything constant

Capture the raw outputs

Save every model's full output for every task before you score anything. You'll want to re-read them when results are surprising, and you'll want a record if a stakeholder questions the conclusion.

Step 5: Score, Then Look at the Distribution

Now you score every output against the rubric from Step 3. But the average is not the whole story.

Go beyond the mean

Segment the results

Step 6: Account for Cost, Speed, and Variance

The highest-scoring model isn't automatically the right choice. Now you fold in the constraints from Step 1.

Weigh the trade-offs

Check run-to-run variance

Run your top candidates two or three times. If a model's score swings several points between runs, that instability is itself a finding, and it should lower your confidence in a narrow lead.

Step 7: Document and Re-Run Over Time

Finish by writing down the decision, the evidence, and the date. Models update, and a choice that was right three months ago may not be right now.

Frequently Asked Questions

How many tasks do I need for a reliable evaluation?

Should I use a model to score the outputs?

What if two models score nearly the same?

How often should I redo the evaluation?

Can I skip public benchmarks and just test myself?

Key Takeaways

Start by defining the actual decision and its constraints, not a vague "which is best."
Build a task set of 50 to 200 real, representative examples including known-hard cases.
Write the scoring rubric before generating any outputs to avoid rationalizing results.
Run every model under identical conditions, then study the distribution and worst cases, not just the average.
Fold in cost, speed, and variance, document the decision, and rerun as models change.

Public Scores Cannot Tell You What Works on Your Tasks

Step 1: Define the Decision You're Making

Pin down the constraints

Name the success criteria

Step 2: Assemble a Representative Task Set

Source from real data

Cover the distribution deliberately

Step 3: Write the Scoring Rubric First

Choose a scoring method

Decide who or what scores

Step 4: Run the Models Under Identical Conditions

Hold everything constant

Capture the raw outputs

Step 5: Score, Then Look at the Distribution

Go beyond the mean

Segment the results

Step 6: Account for Cost, Speed, and Variance

Weigh the trade-offs

Check run-to-run variance

Step 7: Document and Re-Run Over Time

Frequently Asked Questions

How many tasks do I need for a reliable evaluation?

Should I use a model to score the outputs?

What if two models score nearly the same?

How often should I redo the evaluation?

Can I skip public benchmarks and just test myself?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Public Scores Cannot Tell You What Works on Your Tasks

Step 1: Define the Decision You're Making

Pin down the constraints

Name the success criteria

Step 2: Assemble a Representative Task Set

Source from real data

Cover the distribution deliberately

Step 3: Write the Scoring Rubric First

Choose a scoring method

Decide who or what scores

Step 4: Run the Models Under Identical Conditions

Hold everything constant

Capture the raw outputs

Step 5: Score, Then Look at the Distribution

Go beyond the mean

Segment the results

Step 6: Account for Cost, Speed, and Variance

Weigh the trade-offs

Check run-to-run variance

Step 7: Document and Re-Run Over Time

Frequently Asked Questions

How many tasks do I need for a reliable evaluation?

Should I use a model to score the outputs?

What if two models score nearly the same?

How often should I redo the evaluation?

Can I skip public benchmarks and just test myself?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?