The tooling landscape for model version control is confusing because the tools solve different problems while using overlapping vocabulary. A "model registry" and a "data versioning tool" both claim to do version control, but they cover different parts of the reproducible tuple. Pick the wrong category for your actual bottleneck and you'll end up bolting on three more tools to cover the gaps.
This survey organizes the landscape into four functional categories, gives you selection criteria that actually discriminate between options, and ends with a decision approach. We name categories and representative tools, but the goal is to teach you to evaluate, not to crown a winner β your right answer depends on your bottleneck.
The Four Categories
Map any tool to the part of the version tuple it owns. Most teams need coverage in two or three categories, sometimes from one platform, sometimes stitched together.
Data versioning
These tools pin immutable snapshots of training data β the most commonly skipped and most expensive gap. Representative tools include DVC, which versions data alongside Git, and data lake versioning layers like LakeFS that give you Git-like branching over object storage. If your pain is "I can't tell whether the data changed between two models," this is your category.
Experiment tracking
These capture runs, metrics, parameters, and artifacts during development. MLflow Tracking and Weights & Biases are the common choices. They excel at the Capture layer but are not, by themselves, a governed production system of record β a distinction the CARE framework draws clearly.
Model registry
Registries make versions immutable and addressable, handle promotion stages, and govern what reaches production. MLflow's registry and SageMaker's model registry are typical. This is the Anchor and Release layer. If your problem is "two models named final" or "what's actually in production," start here.
General artifact and large-file storage
Underneath everything sits object storage plus large-file handling β Git-LFS at minimum, or plain S3/GCS with disciplined keys. This is not glamorous, but it's the substrate that keeps gigabyte checkpoints out of your Git history, the failure detailed in 7 Common Mistakes with Ai Model Version Control.
Selection Criteria That Actually Discriminate
Most feature lists look identical. These are the criteria that separate tools in practice.
- Lineage completeness β Does it link experiment run to released version to data snapshot, end to end? Many tools cover two of the three and leave a gap you discover during an incident.
- Immutability enforcement β Can a version ID be silently overwritten? If yes, it's a naming convention, not version control.
- Data snapshot support β Does it pin the actual training data, or only the model? Tools that ignore data leave the largest gap unaddressed.
- Promotion governance β Are promotions gated, logged, and tied to an approver, or is "production" just a tag anyone can move?
- Rollback ergonomics β How many steps to serve the previous version? If it requires a rebuild, rollback will fail under pressure.
- Storage model β Does it keep the recipe in Git and the binary in storage, or does it tempt you to commit large files?
Criteria that matter less than vendors claim
Slick dashboards and breadth of integrations sell well but rarely decide outcomes. A beautiful UI over incomplete lineage still fails the audit. Weigh the six criteria above first.
Trade-offs Between Build, Buy, and Stitch
There are three viable postures, and the right one depends on your scale and stakes.
Stitch open-source components. DVC for data, MLflow for tracking and registry, object storage underneath. Cheap, flexible, and you own the integration burden. Good for small-to-mid teams who want control and can absorb the glue work. The seams between tools become your responsibility.
Buy an integrated platform. A managed MLOps platform covers most categories in one system with enforced governance. Faster to a complete CARE implementation, less integration risk, but more lock-in and cost. Good for teams whose stakes justify the spend and who don't want to maintain glue. The case study team chose a hybrid: registry plus data versioning, not a full platform.
Build in-house. Almost never the right call unless your scale or constraints are genuinely unusual. You'll reinvent immutability, lineage, and promotion β and get them subtly wrong in ways that surface during an audit.
How to Choose
Start from your bottleneck, not from a feature comparison. Diagnose which part of the tuple you're failing to control, then adopt the category that owns it.
A practical decision path
- Can't reproduce data? Adopt a data versioning tool first β it's the biggest gap and the cheapest fix.
- Can't tell what's in production? Adopt a model registry with immutability and promotion governance.
- Losing experiment history? Add experiment tracking and link it to your registry.
- Bloated Git repo? Move binaries to object storage with proper large-file handling.
Resist the urge to buy the biggest platform first. Match the tool to the bottleneck, verify it against the six criteria, and only consolidate onto a platform when the integration burden of stitching exceeds the lock-in cost of buying. The trade-offs piece walks through that build-versus-buy decision rule in more depth.
A Reference Stack That Works
To make this concrete, here is a stack that covers all four categories without overspending, and which many mid-sized teams converge on independently. It's not the only good answer, but it illustrates how the categories fit together.
The components
- Data versioning: DVC or an immutable, date-partitioned table layer to pin training snapshots
- Experiment tracking: MLflow Tracking or Weights & Biases to capture runs, metrics, and parameters
- Model registry: MLflow's registry or a cloud-native equivalent for immutable versions and promotion stages
- Artifact storage: S3/GCS underneath, with Git-LFS or pointer files keeping binaries out of Git
The reason this combination works is that each tool owns its category cleanly and the seams are well-trodden. The experiment tracker links to the registry, the registry references the data snapshot, and the storage layer holds the binaries. You get full CARE coverage β Capture, Anchor, Release, and the foundation for Evidence β without a heavyweight platform.
The cost of this stack is the integration glue and the responsibility for keeping the links intact. That's a fair trade for small-to-mid teams. The signal that you've outgrown it is when maintaining the seams between tools starts consuming real engineering time, or when an audit requirement demands enforced, immutable promotion events that a stitched stack provides less rigorously. At that point, consolidating onto an integrated platform stops being premature and starts being justified.
Frequently Asked Questions
Is MLflow enough on its own?
MLflow covers experiment tracking and a model registry well, so it handles Capture, Anchor, and basic Release. Its gap is data versioning β it does not pin your training data, so you typically pair it with DVC or an immutable data layer. For many mid-sized teams, MLflow plus DVC plus object storage is a complete and affordable stack.
Do I need a registry if I already have experiment tracking?
Usually yes. Experiment tracking captures development runs, but it is not designed to be a governed production system of record with immutable promotion and rollback. The registry adds the Anchor and Release governance that keeps production unambiguous. Some platforms bundle both, which is why the categories overlap in marketing.
When is building in-house justified?
Rarely. It's justified only when your scale, latency, or compliance constraints are so unusual that no existing tool fits β and even then, you'll spend real effort reinventing immutability, lineage, and promotion. For nearly everyone, stitching open-source or buying a platform is faster and safer.
How do I avoid buying three overlapping tools?
Map every prospective tool to the part of the version tuple it owns before you buy. Two tools that both claim "version control" may cover the same category and leave another uncovered. Choosing by bottleneck and by the six discriminating criteria prevents redundant purchases.
Key Takeaways
- The landscape splits into four categories: data versioning, experiment tracking, model registry, and artifact storage β map each tool to the tuple part it owns.
- Discriminate on lineage completeness, immutability enforcement, data snapshot support, promotion governance, rollback ergonomics, and storage model β not on UI polish.
- Data versioning is the most commonly skipped and most expensive gap; address it first if you can't reproduce training data.
- Stitch open-source for control, buy a platform when integration burden exceeds lock-in cost, and build in-house almost never.
- Choose from your bottleneck, not from a feature matrix, and consolidate onto a platform only when stitching costs more than it saves.