Ask a vendor how their AI coding tool is performing and you will hear about acceptance rate: the percentage of suggestions developers accept. It is a comforting number because it goes up and to the right. It is also nearly useless on its own. A developer can accept a suggestion, then spend ten minutes fixing it, then ship a bug. That counts as acceptance. The metric measured a keystroke, not value.
If you want to understand how AI code generation works in your organization, you have to measure it like you would measure any other system that produces output of uncertain quality. That means instrumenting the full lifecycle: not just what gets accepted, but what survives review, what survives production, and what it cost in time and tokens to get there. This article defines the KPIs that matter, how to instrument them without a research team, and how to read the signal once you have it.
The framing here pairs well with the ROI analysis, which turns these operational metrics into a financial case. Start with measurement; the money follows.
Why Acceptance Rate Misleads
Acceptance rate conflates three different things: the model's quality, the developer's standards, and the difficulty of the task. A 40 percent acceptance rate could mean a great model used on hard problems or a mediocre model used on trivial ones. The number cannot distinguish them.
Worse, optimizing for acceptance rate creates perverse incentives. The easiest way to raise it is to suggest safe, obvious completions that nobody would reject, which is precisely the code that delivers the least leverage. The suggestions you most want, the ambitious ones that save real time, are the ones most likely to be rejected or edited. A tool that maximizes acceptance is often a tool that has stopped trying.
The Metrics That Actually Predict Value
Retention through review
Track what fraction of AI-generated code survives code review unchanged, or with only trivial edits. This is the single best proxy for genuine quality, because review is where bad code is supposed to die. If most generated code gets heavily rewritten in review, the tool is creating work, not saving it.
Survival in production
Go one layer deeper: of the AI-influenced changes that shipped, how many were reverted or hot-fixed within thirty days compared to your baseline? If AI-heavy changes have a higher revert rate, you have found a quality leak that no acceptance metric would ever show.
Time-to-merge delta
Measure the cycle time of changes where AI contributed meaningfully versus comparable changes where it did not. This captures the real productivity story. Acceptance can be high while time-to-merge gets worse, because developers spend the saved typing time wrestling with subtly wrong suggestions.
Cost per merged change
Combine token spend, tool licensing, and the human time spent reviewing and fixing AI output, then divide by merged changes. This is the denominator that the trade-offs comparison hinges on, and the one vendors never report.
How to Instrument Without a Research Team
You do not need a data science org. You need a few hooks and a willingness to tag changes.
- Tag at the commit or PR level. Add a lightweight label indicating whether AI tooling contributed materially. A commit trailer or a PR template checkbox is enough.
- Pull from systems you already run. Your version control platform, CI, and incident tracker already hold merge times, revert events, and review activity. Join them on the AI tag.
- Sample, do not census. You do not need to measure every change. A representative sample of a few hundred PRs per quarter gives you a stable signal.
- Hold a control group. Keep a slice of work AI-free, or at least untagged, so you have a baseline to compare against. Without a baseline, every number is unanchored.
The getting-started guide covers the tooling setup; the measurement layer sits on top of whatever you already have.
Leading Versus Lagging Indicators
Most teams measure only lagging indicators, the outcomes that appear after work ships. Those are essential, but they tell you about the past. To steer in real time, pair them with leading indicators that move earlier.
Lagging indicators
- Production revert rate on AI-influenced changes versus baseline. The clearest signal of quality leakage, but it arrives weeks after the code shipped.
- Cost per merged change. A true outcome metric, but only computable once changes have merged.
Leading indicators
- Edit distance after acceptance. How much a developer changes a suggestion after accepting it. Large post-acceptance edits predict low retention through review before review even happens.
- Re-prompt frequency. How often developers regenerate before getting usable output. Rising re-prompts signal that the tool is poorly grounded for the current work, an early warning that quality will suffer downstream.
- Time spent in review of AI changes. If reviewers are spending disproportionately long on AI-assisted PRs, the output is creating hidden cost that has not yet shown up in reverts.
The discipline is to use leading indicators to catch problems early and lagging indicators to confirm them. A spike in edit distance this week predicts a dip in retention next month; watching both lets you intervene before the lagging number turns bad.
Avoid the Vanity Metric Trap
Every metric can become a vanity metric if you optimize the number instead of the outcome it represents. Acceptance rate is the obvious example, but retention through review can be gamed too, by reviewers who wave AI code through to keep the number high. The protection is to never let a single metric stand alone. Triangulate: if retention is high but production reverts are climbing, the retention number is being gamed, not earned. The team rollout guide covers building the review culture that keeps these metrics honest. A metric you trust blindly is a metric someone will eventually optimize against you.
Reading the Signal
Numbers only matter against a baseline and over time. A 25 percent retention-through-review rate sounds bad until you learn it was 12 percent last quarter. Watch trends, not snapshots.
Be suspicious of metrics that move in isolation. If acceptance climbs but time-to-merge flattens, the tool is being used for low-value completions. If retention is high but production reverts climb, your reviewers are rubber-stamping AI output, a governance problem the risks article covers in detail. The metrics are most useful as a system: each one is a check on the others, and the story emerges from how they move together.
Frequently Asked Questions
What is the single most important metric to start with?
Retention through code review: the fraction of AI-generated code that survives review with only trivial edits. It is the cleanest proxy for real quality because review is where bad code is supposed to be caught and removed.
Is a high acceptance rate ever a good sign?
It is a weak signal at best. High acceptance with improving time-to-merge and stable production reverts is genuinely good. High acceptance alone often just means the tool is suggesting safe, low-value completions that nobody bothers to reject.
How do I tag which changes used AI?
The lightest approach is a checkbox in your pull request template or a commit trailer. You do not need perfect coverage; a consistent, honest tag on a representative sample is enough to produce stable trends.
Do I need a control group?
You need a baseline of some kind, whether a true AI-free control group or simply your historical metrics from before adoption. Without something to compare against, every number floats free and tells you nothing about impact.
How often should I review these metrics?
Quarterly is usually right. AI coding behavior and tooling change fast enough that monthly is noisy and annual is too slow to catch a regression before it becomes a habit.
Key Takeaways
- Acceptance rate measures keystrokes, not value, and optimizing for it rewards safe, low-leverage suggestions.
- The metrics that predict value are retention through review, survival in production, time-to-merge delta, and cost per merged change.
- Instrument with lightweight tags on commits or PRs, joined to data you already collect in version control, CI, and incident tracking.
- Always read metrics against a baseline and over time; snapshots lie.
- Pair leading indicators (edit distance, re-prompt frequency, review time) with lagging ones (reverts, cost per merge) to steer early and confirm later.
- Never trust a single metric alone; triangulate, because any metric in isolation will eventually be gamed.
- Treat the metrics as a system, where each is a check on the others, and the real story emerges from how they move together.