AI Test Prioritization: The 12% of Code That Holds 80% of Bugs

AI Test Prioritization: The 12% of Code That Holds 80% of Bugs

6/22/202611 views9 min read

TL;DR

  • A small fraction of your codebase — typically around 10-15% — produces the majority of production bugs. Spreading tests evenly across the rest is wasted effort.
  • AI clustering of git history, error logs, and incident records surfaces that fraction objectively, instead of letting senior intuition guess.
  • The Plan → Fact → Gap framing fits cleanly: you planned for uniform quality, the fact is that bugs concentrate, and the gap is the prioritization rubric your test strategy is missing.

After watching 30+ engineering orgs try to "improve test coverage," my conclusion is that the headline coverage number is the wrong metric to chase. The real question is which slice of your codebase produces the bugs — and almost every team has a small, identifiable slice where the answer concentrates.

Why does test coverage as a single number mislead?

Because it averages quality across code that isn't equally risky. A repo at 75% line coverage might have 95% coverage on safe utility code and 30% coverage on the gnarly billing-state-machine module that ships 60% of the production incidents. The headline number says "good," the bug rate says "no."

Definition: Risk-weighted coverage — coverage measured against the probability that the code under test will cause a customer-visible incident if it breaks. Not the same as line coverage.

The pattern across 30-500-employee engineering orgs is consistent: incidents cluster in modules with high commit frequency, high recent churn, and high cross-service surface area. Coverage numbers don't see any of that. A flat coverage target tells the team to add tests to whatever is easiest — which is rarely where the risk lives.

How does AI find the 12% that matters?

By clustering three data sources nobody can hold in their head at once. First, git history — which files change most often, who changes them, how big the changes are, and how often the changes get reverted. Second, error logs and incident records — which modules show up in postmortems, which routes throw the loudest, which services drive customer-visible incidents. Third, the codebase structure — which functions have the highest cyclomatic complexity, which sit on a critical path, which lack tests entirely.

Definition: Bug-density cluster — a set of files or modules with statistically elevated rates of post-merge defects, identified by overlap of git churn, incident history, and complexity signals.

The AI's job is the clustering and ranking — not the judgment. The output is a ranked list of modules with the signals attached: "Module X has the highest cluster score because (a) it's in the top 5% of git churn, (b) it appears in 4 of the last 8 incidents, (c) its average function complexity is 2.5x the repo median, (d) test coverage on the changed lines in the last 90 days is 18%."

That ranked list is the answer to "where do we add tests first." Senior engineers can override the ranking with context the AI doesn't have — but they're starting from a real prior, not a guess.

What the 12% usually looks like

Three categories repeat across SMB engineering orgs:

State machines and workflow logic. Billing, subscription lifecycle, onboarding stages, payment retries. High branch count, high incident history, easy to miss an edge case.

Integration boundaries. Webhook handlers, payment provider adapters, third-party API wrappers. Bugs concentrate where external behavior is unpredictable and the code has to compensate.

Recently-rewritten or recently-acquired code. Anything churned in the last 90 days, anything inherited from an acquisition or a contractor. Recency and unfamiliarity both predict defects independently.

The thing you'll notice: this list is not what coverage-percentage-driven teams test first. They test the easy utility functions because they're cheap to cover. The risky modules are skipped because they're hard.

Copy/paste prioritization template

## Test prioritization — [QUARTER]

### Data sources fed to AI clustering
- Git history (90 days): commit count, file churn, revert count
- Incident records (last 4 quarters): files implicated in SEV1/SEV2
- Error logs (30 days): top error rates by module
- Code complexity: cyclomatic per function, repo-median baseline
- Current test coverage by file

### Top 10 modules to invest in (this quarter)
| # | Module           | Cluster score | Signals                           | Coverage now | Target |
|---|------------------|---------------|-----------------------------------|--------------|--------|
| 1 | [path]           | [N]           | churn=high, inc=4, complex=high  | [N%]         | [N%]   |
| 2 | ...              |               |                                   |              |        |

### Senior engineer overrides
- [Module added: why human knows it matters that AI doesn't see]
- [Module removed: why human knows it doesn't matter that AI flagged]

### Plan → Fact → Gap (review next quarter)
- Plan: invest tests in the top 10 modules above
- Fact: at quarter end, where did bugs actually concentrate?
- Gap: which modules surprised us; what did the clustering miss?

### Owner
- [Name] — accountable for the test-investment rubric this quarter

The senior-override section is the safety valve. The AI surfaces the cluster; the team's most experienced engineers add or remove based on knowledge the clustering can't see. The Plan → Fact → Gap section at the bottom is what makes the rubric improve quarter over quarter.

Tool tip (AIAdvisoryBoard.me): Test prioritization is one slice of a bigger Plan → Fact → Gap pattern that runs across every function in a company. Plan was uniform coverage; fact is concentrated bugs; gap is the missing prioritization rubric. The same pattern shows up in sales pipeline, ops bottlenecks, finance variance — every function has its own version of the 80/20 hidden in the data. Our daily-management OS surfaces those gaps automatically. https://aiadvisoryboard.me/?lang=en.

Manager scan (2-minute digest example)

  • Plan: 80% line coverage target across the repo, applied uniformly
  • Fact: of last quarter's 14 production incidents, 11 touched 6 modules — those 6 had a median of 41% coverage
  • Gap: the coverage target is the wrong rubric; we need risk-weighted prioritization
  • Top 10 cluster-scored modules identified this quarter — written down with owner
  • Senior engineers reviewed and overrode 2 entries (1 added, 1 removed) — captured in template
  • Test investment for next quarter routed to the top 10, not spread evenly
  • "Repeat-incident in same module" rate — track quarter over quarter as outcome metric
  • Coverage headline number ignored at leadership level, kept at engineering level
  • Quarterly review of the rubric — what the AI clustering missed, what to add to inputs
  • Incident-to-test follow-up loop: every SEV1/SEV2 retro adds the module to the next quarter's input

Micro-case (what changes after 7-14 days)

A 120-engineer payments platform was running ~14 SEV2-or-higher incidents per quarter and had a 78% coverage number on the dashboard. Leadership read 78% as "fine," engineering knew bugs were concentrating in the billing state machine and the webhook adapter, and nobody had the rubric to redirect investment. They ran AI clustering across the prior four quarters of incidents, the last 90 days of git history, and the production error logs — and got a ranked top-10 that included billing-state, webhook-adapter, and four modules nobody had flagged: the cron-job orchestrator, the legacy receipt generator, the user-merge utility, and an obscure subscription-pause path. Senior engineers added two more from context (a third-party integration about to renew, a deprecated path scheduled for removal but not yet removed) and removed one (a flagged module that was being deleted next sprint anyway). Test investment for the quarter went to those 11 modules. The next quarter's SEV2-or-higher rate dropped from ~14 to ~5 — and the headline coverage number was lower than before because they deleted tests on safe utility code to make budget.

Note on this case: This example is illustrative — based on typical patterns we observe with companies of 30-500 employees, not a single named client. Specific numbers are rounded approximations of common ranges, not guarantees.

Tool tip (AIAdvisoryBoard.me): The deeper move here isn't "use AI to pick tests" — it's "build a Plan → Fact → Gap loop where last quarter's bugs feed next quarter's test budget." That loop is the same shape as the one we run for sales, ops, and finance: plan, measure fact, name the gap, redirect investment. The 7-day diagnostic shows you where those gaps already exist in your data. https://aiadvisoryboard.me/?lang=en.

FAQ

Doesn't this just rediscover what senior engineers already know? Partly. Seniors know "billing state machine is risky" — but they don't know the relative ranking across 200 modules and they're missing the modules with quiet churn nobody talks about. The AI's value is in catching the second category.

What if our codebase is small enough that we can mental-model it? Then you don't need this. For 5-engineer repos, senior intuition wins. The clustering value kicks in around 8-12 engineers and 100+ modules.

Won't this just generate noise from popular files like utils.ts? A good clustering pipeline weights for complexity and incident history, not raw commit count. utils.ts has high churn but low complexity and rarely shows in postmortems — it falls off the ranked list automatically.

How do we handle the "we know we'll rewrite this module" case? Senior override removes it. Don't invest test budget in code scheduled for deletion. Capture the reason in the template so next quarter's review can verify the rewrite actually happened.

Should this drive deletion of low-value tests too? Yes. Most teams have 20-40% of their tests on code that doesn't break and isn't on the critical path. Cutting those frees engineer time without raising risk. AI can rank low-value tests with the same clustering.

Conclusion

Test coverage as a single number tells you nothing useful about where bugs hide. AI clustering of git history, error logs, and complexity tells you. Plan → Fact → Gap is the loop that turns the ranking into a quarterly investment rubric instead of a one-off audit.

Pull the data sources together this week. Run the clustering once. Get a ranked top 10 by Friday. Override with senior context. Reroute next quarter's test budget.

If you want a system that surfaces the Plan → Fact → Gap automatically — every day, across engineering and the rest of the company — see how the 7-day diagnostic works at https://aiadvisoryboard.me/?lang=en.

Frequently Asked Questions

AI-Powered Solution

Ready to transform your team's daily workflow?

AI Advisory Board helps teams automate daily standups, prevent burnout, and make data-driven decisions. Join hundreds of teams already saving 2+ hours per week.

Save 2+ hours weekly
Boost team morale
Data-driven insights
Start 14-Day Free TrialNo credit card required
Newsletter

Get weekly insights on team management

Join 2,000+ leaders receiving our best tips on productivity, burnout prevention, and team efficiency.

No spam. Unsubscribe anytime.