Most people's first attempt at hypothesis generation with a model goes like this: type "give me some hypotheses about why our signups dropped," get back a tidy list of generic guesses, and walk away vaguely unimpressed. The technique works, but a cold one-line prompt extracts almost none of its value. The difference between a useless list and a genuinely helpful one comes down to a few preparation steps and a more deliberate structure.
This walkthrough takes you from nothing to a first real result: a filtered set of testable hypotheses you would actually be willing to act on. It assumes no special tooling, just a capable model and a clear problem. The point is to reach one credible outcome, not to master every refinement.
Once you have done this once, the pattern repeats for almost any investigative question, from a data anomaly to a product metric to a research puzzle.
Before You Prompt Anything
The quality of your hypotheses is decided mostly before you write the prompt. Skipping this stage is why first attempts disappoint.
State the question sharply
A vague question yields vague hypotheses. "Why are signups down" is too loose. "Why did mobile signups drop 18 percent in the two weeks after the April 3 onboarding redesign" gives the model variables, a timeframe, and a likely trigger to reason about. Spend real effort here; it pays back the most.
Gather the context the model will need
Write down what you already know: relevant metrics and their movements, recent changes, things you have already ruled out, and constraints. This becomes the context you feed the model. Grounding the prompt in your actual situation is the single biggest upgrade over a cold prompt, a theme that recurs across Hypothesis Generation Is Shifting From Brainstorm to Pipeline.
Record your own baseline
Before the model sees anything, jot down the hypotheses you already have. This serves two purposes: it tells the model what not to repeat, and it lets you judge later whether the model added anything new.
Structuring the First Prompt
With preparation done, the prompt itself becomes straightforward.
Give role, context, and a clear ask
Tell the model the situation, supply your context notes, list what you have already considered, and ask specifically for testable hypotheses, not explanations or recommendations. Request that each hypothesis name the variables involved and suggest how it could be confirmed or refuted. That single instruction transforms the output from guesses into candidates you can act on.
Ask for variety on purpose
Add an instruction to span different categories of cause, technical, behavioral, external, measurement-related, so the model does not cluster around one obvious area. Explicitly asking for diversity prevents the common failure of ten restatements of the same idea.
Request more than you need
Ask for a dozen or so candidates even though you will keep far fewer. Generating wide and filtering down beats asking for a small polished list, because the filtering is where your judgment adds value.
Filtering Down to What Matters
A raw list is not a result. The filtering step is what makes the exercise worthwhile.
Gate on testability and plausibility
Drop anything you cannot test with resources you have, and anything that is implausible given what you already know. Be ruthless; a shorter list of strong candidates beats a long list padded with filler. This is the same usable-yield discipline detailed in Which Numbers Tell You a Hypothesis Prompt Is Working.
Flag what is genuinely new
Compare the survivors against the baseline you wrote earlier. The hypotheses that are both new and testable are the model's real contribution. These are usually where the surprise value lives, the angle a domain expert was too close to notice.
Prioritize by test cost and impact
Rank the survivors by how cheap they are to test and how much they would explain if true. Start with cheap, high-explanatory candidates. This turns your filtered list into an ordered plan rather than an undifferentiated pile.
Closing the Loop on Your First Run
One result is a start, but the habit that makes this compound is recording what happened.
Note which hypotheses you test
Keep a simple record of which candidates you actually tested and what you found. Even a single row in a spreadsheet begins the outcome data that, over time, tells you which prompts produce ideas that survive. Skipping this is the most common reason teams never learn whether the practice helps.
Refine the prompt from what you learned
If the output skewed generic, your context was probably thin; add more. If it clustered narrowly, push harder on the diversity instruction. Small adjustments compound. When you are ready to go deeper, Pushing Hypothesis Prompts Past the Obvious covers multi-pass techniques that build on this foundation.
A Worked Example to Anchor the Pattern
Abstract steps are easier to apply against a concrete case. Walk through a typical one.
The situation
Suppose mobile signups dropped 18 percent in the two weeks after an onboarding redesign. Your sharpened question names the metric, the magnitude, the platform, and the likely trigger. Your context notes record that desktop signups held steady, that the redesign added a new email-verification step, and that you have already ruled out a tracking error because two independent analytics sources agree.
What the prompt should surface
A well-constructed prompt, given that context and your existing baseline, should return candidates spanning categories: a friction hypothesis (the new verification step is causing drop-off), a measurement hypothesis (mobile attribution changed with the release), an external hypothesis (a coincident marketing change shifted traffic mix), and a technical hypothesis (the redesign renders poorly on certain mobile devices). The value is in getting that spread, not in any single guess.
Filtering it down
You drop the measurement hypothesis because your two-source check already ruled it out, gate the rest on testability, and notice that the device-rendering hypothesis was not on your baseline list. That genuinely new, cheaply testable candidate, check signup completion by device type, becomes your first experiment. This is the divergence-and-filter rhythm that the more advanced techniques in Pushing Hypothesis Prompts Past the Obvious build directly on.
Frequently Asked Questions
Do I need a specialized or expensive model for this?
No. Any reasonably capable general model handles hypothesis generation well. Your preparation, sharp question, real context, explicit baseline, matters far more than model choice for a first result. Upgrade the model only after you have exhausted the gains from better prompting.
How much context is enough to include?
Enough to make the question specific and to rule out the obvious, but not a data dump. Summarize the relevant metrics, recent changes, and what you have already excluded. A focused half-page of context beats pasting raw logs the model has to wade through.
What if all the hypotheses look obvious?
Usually a sign your context was too thin or your diversity instruction too weak. Add what you have already considered to the prompt so the model is pushed past it, and explicitly ask for non-obvious or contrarian angles. Obvious output is a prompt problem more than a model limitation.
Should I ask for explanations along with the hypotheses?
Ask for a brief justification and a suggested test for each, but not long essays. You want enough reasoning to judge plausibility and testability, not a wall of text. Concise candidates are easier to filter and rank.
How do I avoid being misled by a confident but wrong hypothesis?
Treat every candidate as a guess to be tested, never as a finding. The model's confidence carries no information about truth. Your gate on testability exists precisely so that plausibility is decided by experiment, not by how persuasive the wording sounds.
How long should a first run take?
Once your context is prepared, the prompt and review take well under an hour for most questions. The preparation, sharpening the question and gathering context, is where the time goes, and it is time well spent.
Key Takeaways
- Quality is decided before you prompt: a sharp question, real context, and a written baseline do most of the work.
- Ask explicitly for testable hypotheses that name variables and suggest how to confirm them, and ask for deliberate variety.
- Generate wide, then filter ruthlessly on testability, plausibility, and novelty; the filtering is where your judgment adds value.
- Record which candidates you tested and what happened, even minimally; that outcome data is what makes the practice compound.
- Model choice matters far less than preparation for a first credible result.