Can Software Metrics Really Help Us Find Defects?

Teams shouldn’t ask “Will this module have a defect?”
They should ask “Which modules are riskier than others?”

Why This Analysis Exists?

This analysis does not claim to perfectly predict defects. Instead, it evaluates whether common software metrics — lines of code (LOC), cyclomatic complexity (v(g)), essential complexity (ev(g)), design complexity (iv(g)), and effort — provide actionable prioritization signals.
The goal is to determine whether these metrics meaningfully concentrate testing effort on higher‑risk modules, even when overall prediction accuracy is limited.

Dataset Overview

CM1 is a publicly available NASA software dataset frequently used to study defect patterns in real systems. It comes from a spacecraft instrumentation project, meaning the code was developed in a safety‑critical environment rather than a toy or academic example.

Each row in CM1 represents a single software module. For every module, the dataset records whether at least one defect was found, along with a set of static code metrics collected during development. There is no time series data and no defect counts — just a binary outcome per module: was this code ever defective or not?

The metrics fall into three broad categories:
Size metrics, such as lines of code, which capture how large a module is.
Complexity metrics, including cyclomatic, essential, and design complexity, which describe how tangled the control flow and structure are.
Effort‑related metrics, which reflect how much work went into building the module.

Evidence from Simple Regression (CM1)

Each metric was first analyzed independently using simple regression, with the defect flag as the outcome variable.

Individually, these metrics all show a consistent directional relationship with defects: as size, effort, or complexity increase, defect likelihood increases.

Metric	Coefficient sign	p‑value	Interpretation
LOC / Effort (t)	Positive	Significant	Larger modules → more defects
v(g)	Positive	Highly significant	More branches → more defects
ev(g)	Positive	Significant	More unstructured logic → more defects

Evidence from Descriptive Statistics and Bucketing

Bucketed views of the CM1 data reveal clear structural patterns:

High‑LOC modules almost always fall into:
- Higher v(g) buckets
- Higher ev(g) buckets
Low‑LOC modules rarely exhibit high v(g) or ev(g)

These patterns explain what happens next in multivariate models. When metrics are combined, the regression is effectively saying:

“Once I already know v(g), LOC and ev(g) don’t explain much additional variation.”

This is not an opinion — it is encoded directly in the statistics: – Rising p‑values – Inflated standard errors – A relatively stable overall R².

Evidence from Multiple Regression (CM1)

When LOC/effort, v(g), ev(g), and iv(g) are included together in a single model:

Metric	What changed
LOC / Effort	p‑value increased sharply → became insignificant
ev(g)	Coefficient shrank and/or crossed the significance threshold
v(g)	Remained significant
iv(g)	Weak or marginal throughout

The takeaway is not that LOC or ev(g) are “bad” metrics, but that they overlap heavily with other measures. Once core complexity is accounted for, they add little independent explanatory power.

Data Analysis Methodology

This study analyzes the NASA CM1 module‑level dataset using three complementary techniques:
Bucket analysis, grouping modules by size and complexity ranges to observe how defect rates change as metric values increase.
Simple regression, quantifying the individual relationship between each metric and the binary defect indicator.
Multiple regression, combining metrics in a single model to examine overlap and multicollinearity, and to identify which metrics retain explanatory power when others are controlled for.

What the Data Revealed?

The bucketed views show defect rates climbing steadily as modules move into higher size, effort, and complexity ranges. The pattern is not perfectly smooth, but the direction is consistent: larger and harder‑worked modules carry more risk.

The regression results reinforce the same story from a different angle. Instead of comparing buckets, the models quantify the trend across all modules at once, showing the same positive relationships — even when each individual effect is weak. In other words, the buckets show where risk concentrates, and the regression confirms that the concentration is not random.

In CM1, individual metrics like LOC and effort have low R² values (≈ 6%), meaning they explain only a small portion of the variation in defects. This makes them unsuitable for exact defect prediction. However, both metrics are statistically significant (p < 0.001), indicating a consistent directional signal: as module size and effort increase, defect likelihood increases.

When all metrics are combined, effort remains significant while LOC loses significance, indicating that size‑related metrics are capturing the same underlying risk. This does not make them useless — it tells us where the risk concentrates.

Using CM1, prioritizing the highest‑LOC and highest‑effort modules yields better risk coverage than random or uniform testing, even with imperfect prediction.

Why Teams Rely on LOC and Complexity?

Teams default to LOC and complexity because they are objective, inexpensive to collect, and consistently highlight large or tangled modules. In CM1, modules in the top LOC and v(g) buckets account for a disproportionate share of observed defects, making these metrics practical early warning signals — even before testing begins.

Why Defect Prediction Is Tempting?

Defect prediction promises a shortcut: focus limited testing effort on the riskiest code instead of spreading resources evenly. Given that only a minority of CM1 modules are defective, it is natural for managers to gravitate toward models that claim to identify high‑risk modules upfront.

Size and Effort (LOC and Effort)

What we expected
Larger modules and higher effort were expected to show higher defect rates, based on the intuition that more code and more work create more opportunities for mistakes.

What actually happened
In CM1, both LOC and effort show a clear upward trend: modules with more code and higher recorded effort are more likely to be defective. Bucketed views reveal steadily increasing defect proportions, and simple regressions confirm a positive relationship with the defect flag.

What surprised us
When LOC and effort are analyzed together, much of their explanatory power overlaps. In multiple regression, LOC weakens substantially once effort is included, suggesting that how much work a module required captures risk better than raw size alone.

Complexity Metrics (v(g), ev(g), iv(g))

What we expected?
Higher control‑flow and structural complexity were expected to correlate strongly with defects, potentially even more than size.

What actually happened?
In CM1, all three complexity metrics trend upward with defects, but their effects are weaker and less consistent than size and effort. v(g) shows the clearest relationship, while ev(g) and iv(g) display flatter patterns and higher variance across modules.

What surprised us?
When combined in a multiple regression, the complexity metrics compete with one another. Only v(g) retains a meaningful signal, while ev(g) and iv(g) lose significance — indicating substantial overlap in what they measure and limited additional explanatory value beyond basic control‑flow complexity.

Key Takeaway

For CM1, size and effort dominate defect risk, while complexity metrics refine — but do not replace — those signals. Complexity is most useful as a secondary prioritization lens rather than a primary predictor.

Interpreting the Results (Why the Charts Aren’t the Point)

This analysis requires only a handful of charts, because the charts are not the story — the interpretation is.

One of the most striking results is the consistently low R² values across the CM1 regressions. At first glance, this can feel disappointing. But for defect data, it is exactly what we should expect. Defects are binary, sparse, and influenced by many factors not captured here: developer experience, requirement churn, schedule pressure, and review quality. No simple metric model will explain most of that variation.

What matters more is statistical significance. Even with low R², several metrics show consistently low p‑values. That tells us the relationship is not random. In plain terms: the signal is weak, but it is real.

These metrics explain risk, not certainty. A high‑LOC or high‑effort module is not guaranteed to be defective, and a small, simple module is not guaranteed to be clean. What the metrics provide is a way to say: if something is going to break, it is more likely to be over here than over there.

That framing is critical — and it is why these models remain useful despite their limitations.

What This Means for Testing and QA Teams?

First, you do not use these metrics to predict defects. You use them to prioritize attention.

A QA team working with CM1‑style data would start by ranking modules using a small set of signals: size, effort, and basic complexity. Modules that are large and required high effort become obvious candidates for deeper testing, more reviews, or additional static analysis.

Second, this approach makes trade‑offs explicit. When testing time is limited — and it always is — these metrics provide a defensible reason to test Module A before Module B. That is far better than relying on gut feel or whoever shouts the loudest.

Third, metrics help teams scale judgment. Experienced engineers already make these assessments intuitively. Metrics allow that intuition to be applied consistently across hundreds of modules, even when the system is too large for any one person to fully understand.

Most importantly, this approach is not about replacing engineers. It is about supporting decisions with evidence, even when that evidence is imperfect.

A Note on Regression and Prediction

Two points from the original Q&A are worth keeping:

Low R² vs. low p‑value
A low R² does not mean the model is useless. It means defects are hard to explain with a small set of metrics. A low p‑value tells us the relationship still exists and is not random.

Regression ≠ prediction
Regression helps us understand which factors matter and how they relate. It does not tell us exactly where the next defect will occur. Treating it as a prediction tool is how teams get disappointed.

Why This Matters

What is most striking about CM1 is not that the models are weak, but that they remain consistently informative despite the noise.
That is the real takeaway: even simple metrics, used carefully, can improve how teams decide where to focus. Not perfectly. Not magically. But better than guessing.
And in testing and QA, better than guessing is already a meaningful win.

BeyondAssertions

Can Software Metrics Really Help Us Find Defects?

Why This Analysis Exists?

Dataset Overview

Evidence from Simple Regression (CM1)

Evidence from Descriptive Statistics and Bucketing

Evidence from Multiple Regression (CM1)

Data Analysis Methodology

What the Data Revealed?

Why Teams Rely on LOC and Complexity?

Why Defect Prediction Is Tempting?

Size and Effort (LOC and Effort)

Key Takeaway

Interpreting the Results (Why the Charts Aren’t the Point)

What This Means for Testing and QA Teams?

A Note on Regression and Prediction

Why This Matters

Leave a comment Cancel reply

Can Software Metrics Really Help Us Find Defects?

Why This Analysis Exists?

Dataset Overview

Evidence from Simple Regression (CM1)

Evidence from Descriptive Statistics and Bucketing

Evidence from Multiple Regression (CM1)

Data Analysis Methodology

What the Data Revealed?

Why Teams Rely on LOC and Complexity?

Why Defect Prediction Is Tempting?

Size and Effort (LOC and Effort)

Key Takeaway

Interpreting the Results (Why the Charts Aren’t the Point)

What This Means for Testing and QA Teams?

A Note on Regression and Prediction

Why This Matters

Share this:

Leave a comment Cancel reply