A model will happily hand you twenty hypotheses for almost any question you pose. The hard part is not getting hypotheses out of it; the hard part is knowing whether those hypotheses are worth a researcher's afternoon. Without measurement, teams default to vibes: the list looks impressive, someone picks the idea that sounds clever, and nobody can say afterward whether the model helped or merely entertained.
Measuring hypothesis generation is unusual because the artifact is a candidate idea, not a finished answer. You cannot grade a hypothesis as right or wrong on the spot. You can only judge whether it is plausible, testable, and distinct from what you already had, then track which generated ideas survive contact with real data. That two-stage reality, immediate quality signals plus delayed outcome signals, shapes every metric in this article.
The goal here is a practical scorecard. You want a handful of numbers that tell you when a prompt is producing usable raw material and when it is producing filler dressed up as insight.
Why Output Quality Resists Simple Scoring
Hypothesis generation sits in an awkward spot. A summarization task has a reference answer. A classification task has a label. A generated hypothesis has neither, because the entire point is to propose something you do not yet know to be true.
The two-layer evaluation problem
You are measuring two different things that arrive on different timelines:
- Immediate quality: Is this hypothesis well-formed, testable, and grounded in the context you provided? You can assess this the moment the model responds.
- Eventual value: Did the hypothesis, once tested, turn out to explain something real? This signal can take weeks and depends on experiments you may never run for most candidates.
Treating these as one number is the most common mistake. A prompt can score beautifully on immediate quality and still produce hypotheses that never pan out, or generate awkward, oddly-phrased candidates that nonetheless point at real effects.
Volume is not value
The seductive metric is raw count: this prompt produced thirty hypotheses, that one produced eight. Count tells you almost nothing on its own. A prompt that emits thirty near-duplicate restatements of the same idea is worse than one that emits eight genuinely different angles. Always pair volume with a diversity and a quality filter before you read anything into it.
The Core Metrics Worth Tracking
These are the measures that survive scrutiny. None is sufficient alone; together they form a usable picture.
Yield and usable yield
Raw yield is the number of distinct hypotheses generated per prompt. Usable yield is the subset that pass a basic gate: well-formed, on-topic, and testable with resources you actually have. The ratio of usable to raw is your signal-to-noise indicator. A prompt with high raw yield but low usable yield is wasting reviewer attention.
Novelty against a known baseline
Novelty asks how many generated hypotheses were not already on your team's list. Capture the human-generated baseline before you run the model, then count how many model outputs are genuinely new rather than rephrasings of existing ideas. This is where models often earn their keep, surfacing angles a domain expert overlooked because they were too close to the problem.
Testability rate
A hypothesis you cannot test is a guess. Score each candidate on whether it specifies a measurable relationship, names the variables involved, and implies a clear way to confirm or refute it. The testability rate is the share of outputs that clear this bar. Low testability usually traces back to a vague prompt that never asked for falsifiable statements.
Downstream hit rate
This is the metric that ultimately matters and the one teams skip because it is slow. Of the hypotheses you actually tested, what fraction held up? Tracking this closes the loop between prompt quality and real-world value. Even a small, lagging sample is more honest than any immediate proxy.
Instrumenting the Pipeline
Metrics are only as good as the data collection behind them. The instrumentation does not need to be elaborate, but it does need to be consistent.
Log the inputs, not just the outputs
Capture the exact prompt, the context documents supplied, the model and settings used, and the full generated list. Without the inputs, you cannot attribute a quality change to a prompt change versus a context change versus model drift. This same discipline shows up across How to Run a Prompt Experiment Without Fooling Yourself, where attribution failures quietly corrupt conclusions.
Use a rubric, not gut feel
Define a short rubric for human scoring: well-formed (yes/no), testable (yes/no), novel against baseline (yes/no), plausibility (1 to 5). A two-rater check on a sample keeps scoring honest. Rubrics are tedious but they make your novelty and testability rates comparable across weeks.
Separate the test set from the working set
Hold out a fixed set of representative questions you re-run whenever you change a prompt. This is your benchmark. The questions you generate hypotheses for in daily work are your working set and will drift; the benchmark stays stable so you can see whether a change actually improved things.
Reading the Signal
Numbers without interpretation create false confidence. A few patterns recur.
When high yield is a warning
If yield jumps but usable yield and novelty stay flat, the model is padding. This often follows a prompt edit that loosened constraints. The fix is to tighten the gate, not celebrate the bigger list.
When novelty drops over time
Falling novelty against a baseline can mean your team has caught up to the model, which is fine, or that the prompt has collapsed onto a narrow region of ideas. Inspect the actual outputs before deciding. The same diagnostic instinct appears in The Numbers Behind a Prompting Investment, where a metric moving is only useful once you know why.
Trusting the lagging metric
Immediate metrics are proxies. When immediate quality and downstream hit rate disagree over a meaningful sample, believe the hit rate. It is the only measure tied to reality, and it should anchor any decision about whether the practice earns its place. For teams formalizing this, Standards That Keep a Team's Hypothesis Work Honest covers how to make hit-rate tracking a shared habit rather than one analyst's spreadsheet.
Frequently Asked Questions
How many hypotheses should a good prompt produce?
There is no universal number, and chasing one is a trap. The right target is the count at which usable yield plateaus. If asking for more candidates produces only near-duplicates, you have passed the useful point. Most teams find a sweet spot between eight and fifteen distinct candidates per question.
Can I automate the quality scoring with another model?
Partially. A second model can flag malformed or off-topic hypotheses reliably and can estimate testability reasonably well. It is much weaker at judging novelty and plausibility in a specialized domain, because it lacks your team's baseline and context. Use it to triage, then have a human score the survivors.
What is the single most important metric if I can only track one?
Downstream hit rate, despite being the slowest to collect. Every immediate metric is a proxy for it. If you can only afford one number, track the fraction of tested hypotheses that held up, even on a tiny sample.
How do I measure novelty without a documented baseline?
Build one before your first run. Have the relevant experts brainstorm the question for fifteen minutes and write down their hypotheses. That list becomes your baseline. Without it, novelty is unmeasurable and you are left guessing whether the model contributed anything new.
Should plausibility be scored by the prompt author?
No, or at least not only. Authors are biased toward seeing value in their own prompt's output. Use an independent reviewer, ideally a domain expert who did not write the prompt, and check inter-rater agreement on a sample to catch drift.
How often should I re-run the benchmark?
Whenever you change the prompt, the model, or the context strategy, and on a regular cadence even when nothing changed, to catch model drift from provider updates. A monthly baseline run plus per-change runs is a reasonable rhythm for most teams.
Key Takeaways
- Hypothesis generation has two evaluation layers: immediate quality you can score now, and downstream value that arrives slowly. Do not collapse them into one number.
- Track usable yield, novelty against a documented baseline, testability rate, and downstream hit rate together; any one alone misleads.
- Volume is a vanity metric until you pair it with a quality and diversity filter.
- Instrument the inputs, not just the outputs, or you cannot attribute changes to anything.
- When immediate metrics and downstream hit rate disagree, trust the hit rate; it is the only measure tied to reality.