Abstract:In the past 20 years, defect prediction studies have generally acknowledged the effect of class size on software prediction performance. To quantify the relationship between object-oriented (OO) metrics and defects, modelling has to take into account the direct, and potentially indirect, effects of class size on defects. However, some studies have shown that size cannot be simply controlled or ignored, when building prediction models. As such, there remains a question whether, and when, to control for class size. This study provides a new in-depth examination of the impact of class size on the relationship between OO metrics and software defects or defect-proneness. We assess the impact of class size on the number of defects and defect-proneness in software systems by employing a regression-based mediation (with bootstrapping) and moderation analysis to investigate the direct and indirect effect of class size in count and binary defect prediction. Our results show that the size effect is not always significant for all metrics. Of the seven OO metrics we investigated, size consistently has significant mediation impact only on the relationship between Coupling Between Objects (CBO) and defects/defect-proneness, and a potential moderation impact on the relationship between Fan-out and defects/defect-proneness. Other metrics show mixed results, in that they are significant for some systems but not for others. Based on our results we make three recommendations. One, we encourage researchers and practitioners to examine the impact of class size for the specific data they have in hand and through the use of the proposed statistical mediation/moderation procedures. Two, we encourage empirical studies to investigate the indirect effect of possible additional variables in their models when relevant. Three, the statistical procedures adopted in this study could be used in other empirical software engineering research to investigate the influence of potential mediators/moderators.

Is Bigger Data Better for Defect Prediction - Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction.

Unifying Defect Prediction, Categorization, and Repair by Multi-Task Deep Learning

An Improved Semi-Supervised Learning Method for Software Defect Prediction.

Deep Learning for Just-In-Time Defect Prediction

A systematic review of unsupervised learning techniques for software defect prediction

Combined Classifier for Cross-Project Defect Prediction: an Extended Empirical Study.

Revisiting Unsupervised Learning for Defect Prediction

Exploring better alternatives to size metrics for explainable software defect prediction

Towards an Understanding of Intra-Defect Associations: Implications for Defect Prediction

Cross-Project and Within-Project Semi-Supervised Software Defect Prediction Problems Study Using a Unified Solution

Is Deep Learning Good Enough for Software Defect Prediction?

Research on Software Defect Prediction and Analysis Based on Machine Learning

UDA-DP: Unsupervised Domain Adaptation for Software Defect Prediction

Does class size matter? An in-depth assessment of the effect of class size in software defect prediction

EVALUATING THE EFFECT OF DATASET SIZE ON PREDICTIVE MODEL USING SUPERVISED LEARNING TECHNIQUE

An Empirical Study on Heterogeneous Defect Prediction Approaches

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Software-defect prediction within and across projects based on improved self-organizing data mining

Understanding machine learning software defect predictions

Effort-aware Just-in-time Defect Prediction: Simple Unsupervised Models Could Be Better Than Supervised Models.

Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously