Learning from Imbalanced Data for Predicting the Number of Software Defects

Xiao Yu,Jin Liu,Zijiang Yang,Xiangyang Jia,Qi Ling,Sizhe Ye
DOI: https://doi.org/10.1109/issre.2017.18
2017-01-01
Abstract:Predicting the number of defects in software modules can be more helpful in the case of limited testing resources. The highly imbalanced distribution of the target variable values (i.e., the number of defects) degrades the performance of models for predicting the number of defects. As the first effort of an in-depth study, this paper explores the potential of using resampling techniques and ensemble learning techniques to learn from imbalanced defect data for predicting the number of defects. We study the use of two extended resampling strategies (i.e., SMOTE and RUS) for regression problem and an ensemble learning technique (i.e., the AdaBoost. R2 algorithm) to handle imbalanced defect data for predicting the number of defects. We refer to the extension of SMOTE and RUS for predicting the Number of Defects as SmoteND and RusND, respectively. Experimental results on 6 datasets with two performance measures show that these approaches are effective in handling imbalanced defect data. To further improve the performance of these approaches, we propose two novel hybrid resampling/boosting algorithms, called SmoteNDBoost and RusNDBoost, which introduce SmoteND and RusND into the AdaBoost. R2 algorithm, respectively. Experimental results show that SmoteNDBoost and RusNDBoost both outperform their individual components (i.e., SmoteND, RusND and AdaBoost. R2).
What problem does this paper attempt to address?