Once you accept that vector search means finding nearest neighbors, the interesting decisions are about how. There is no single best way to index vectors; there are competing approaches, each trading away something to gain something else. Flat search is exact but slow at scale. Graph-based indexes are fast but memory-hungry. Inverted-file methods balance differently again. Picking among them is a real engineering choice, not a default.
This article lays out the main approaches, names the axes that genuinely separate them, and offers a decision rule you can apply to your own workload. The goal is not to crown a winner but to let you reason from your constraints to the right pick, and to recognize when your constraints have shifted enough to revisit it.
The reason this deserves careful thought is that the wrong default is easy to adopt and expensive to unwind. Pick an exotic index for a tiny corpus and you pay in complexity forever; pick exact search for a huge one and you pay in latency on every query. Neither error announces itself loudly. Both simply make the system a little worse than it should be, day after day, until someone asks why. We start with the approaches, then the axes, then the rule that ties them together.
The Competing Approaches
Flat (Exact) Search
Flat search compares your query against every stored vector and returns the true nearest neighbors. It is perfectly accurate, has no index to build or tune, and stays simple. Its weakness is speed: cost grows linearly with the number of vectors, so it becomes slow once you reach large corpora. For tens of thousands of vectors, though, it is often the right and simplest choice.
Graph-Based Indexes
Graph methods, the family that includes HNSW, build a navigable network of vectors so a query can hop toward its neighbors without scanning everything. They deliver excellent speed and recall together, which is why they dominate at scale. The cost is memory: the graph itself consumes significant RAM, and building it takes time. They also handle frequent updates less gracefully.
The appeal of graph indexes is that they break the usual rule that more accuracy means more time. By navigating intelligently rather than scanning, they keep recall high while staying fast even across millions of vectors. That is why they have become the default for many production systems. The catch worth internalizing is that the whole graph generally needs to sit in memory to perform, so as your corpus grows, your RAM bill grows with it, sometimes faster than you expect.
Inverted-File and Quantization Methods
These approaches cluster vectors and search only the relevant clusters, often compressing vectors to save memory. They scale to enormous corpora with modest memory by accepting a larger hit to recall. Tuning how many clusters you probe lets you slide along the speed-recall curve. They suit massive datasets where memory is the binding constraint.
The compression is the clever part and the risky part both. By storing each vector in a compact, lossy form, these methods fit far more vectors in the same memory, which is what makes hundred-million-vector corpora affordable. But compression discards information, so two vectors that were distinct can look identical after the fact, nudging recall down. The number of clusters you probe at query time is your recovery dial: probe more and you reclaim recall at the cost of speed. Tuning that dial against a real recall target is what separates a workable massive-scale index from one that silently returns mediocre matches.
The Axes That Separate Them
Recall Versus Speed
The central trade-off is how much accuracy you sacrifice for latency. Flat search sits at maximum recall and lowest speed at scale; approximate methods buy speed by accepting some missed neighbors. Knowing your required recall, as stressed in Twelve Items to Verify Before You Trust a Vector Index, tells you how far along this curve you can afford to go.
Memory Versus Scale
Graph indexes trade memory for speed; quantization trades recall for memory. Your available RAM and corpus size jointly decide which trade is tolerable. A method that is ideal at a million vectors may be unaffordable at a hundred million.
Freshness Versus Build Cost
Some indexes update cheaply as content changes; others are expensive to rebuild and awkward to update incrementally. If your corpus churns constantly, weight update cost heavily, a concern the freshness practices in Opinionated Rules for Running Embeddings in Production make concrete. For static corpora, build cost barely matters.
A Decision Rule
Start From Scale and Recall Needs
Below a few hundred thousand vectors, default to flat search: it is exact, simple, and fast enough. Above that, if you need high recall and have the memory, choose a graph index. If memory is the binding constraint at very large scale, accept a quantization method and tune cluster probing to recover recall. This single rule covers most cases.
Revisit When Constraints Shift
The right choice is a function of scale, recall needs, memory, and churn, and all of those change. Re-run the rule when your corpus grows an order of magnitude, when latency requirements tighten, or when memory pressure appears. Treating the index choice as revisitable, rather than permanent, keeps the system matched to reality, much as the staged thinking in The RECALL Model: Five Stages for Embedding Pipelines recommends.
The trap to avoid is premature optimization in the other direction: reaching for a sophisticated graph or quantization index while your corpus is still small, because an article praised its benchmarks. At small scale, flat search beats them on simplicity and matches them on user-perceived speed, while the exotic index adds tuning, memory cost, and a recall sacrifice you did not need. Let scale earn the complexity. The index you can reason about easily is worth a great deal until the day your numbers genuinely demand more.
Beyond the Index: Other Trade-Offs
Semantic Versus Keyword Retrieval
A higher-level trade-off sits above the index: whether to match by meaning, by exact tokens, or both. Vectors handle paraphrase; keywords handle precise terms. Many strong systems blend them, paying complexity to gain robustness across query types, a pattern visible in Inside Five Products Powered by Nearest-Neighbor Lookup.
Precision Versus Recall in Results
Even with retrieval settled, you trade between showing fewer, surer results and showing more, riskier ones. Re-ranking and filtering shift this balance. The right point depends on whether a missed relevant item or an irrelevant shown item costs you more.
Frequently Asked Questions
When is exact flat search actually the right choice?
When your corpus is small, up to a few hundred thousand vectors, flat search is exact, requires no tuning, and runs fast enough. Reaching for an approximate index at that scale adds complexity and a recall sacrifice you do not need. Simplicity is a feature.
Why are graph indexes so memory-hungry?
They store a navigable network of connections between vectors so queries can hop toward neighbors quickly. That graph structure lives in memory in addition to the vectors themselves, which is what buys the speed. The memory cost is the price of avoiding a full scan.
How do quantization methods scale so large?
They compress vectors and cluster them, so the index searches only relevant clusters and stores vectors more compactly. This slashes memory use, letting enormous corpora fit, at the cost of some recall. Tuning how many clusters you probe recovers recall against added latency.
What if my content changes constantly?
Weight freshness and update cost heavily. Some indexes update incrementally and cheaply; others require expensive rebuilds. For high-churn corpora, prefer methods that absorb updates gracefully, or design a rebuild cadence that keeps the index acceptably fresh.
Should I combine semantic and keyword search?
Often yes, when query traffic mixes precise terms with conceptual questions. Blending the two adds engineering complexity but improves robustness, since each covers the other's weakness. If your queries are uniformly one type, a single approach may suffice.
How often should I revisit my index choice?
Whenever a binding constraint shifts: an order-of-magnitude growth in corpus, tighter latency requirements, new memory pressure, or a change in churn. The right index is a function of those variables, so re-run the decision rule when they move rather than treating the choice as permanent.
Key Takeaways
- Vector indexing offers competing approaches: exact flat search, fast memory-hungry graph indexes, and scalable quantization methods.
- The axes that separate them are recall versus speed, memory versus scale, and freshness versus build cost.
- The decision rule: flat below a few hundred thousand vectors, graph for high recall with memory headroom, quantization when memory binds at huge scale.
- Index choice is a function of shifting constraints, so revisit it when scale, latency, memory, or churn changes.
- Above the index sits the semantic-versus-keyword trade-off, often best resolved by blending both.
- Re-ranking and filtering let you tune the final precision-versus-recall balance to match what a mistake costs you.