Software Defect Prediction Method Based on Hybrid Sampling

Xiaozhi Du,Hehe Yue,Honglei Dong
DOI: https://doi.org/10.1145/3474198.3478215
2021-01-01
Abstract:Software defect prediction is an essential technology to provide guidance and assistance for software testers and developers. However, the problem of imbalanced data sets limits the effect and application of the software defect prediction. To address this issue, this paper proposes a software defect prediction method based on hybrid sampling, which combines the strategies of over-sampling with under-sampling. For minority class, over-sampling uses k-means to cluster samples, then adopts SMOTE to generate artificial data based on safe areas of the clustering outcome. For majority class, under-sampling uses logistic regression classifier to get the misclassification probability of each sample and its instance hardness value. Then the samples, whose instance hardness values are lower than the threshold, are removed from the datasets. The experimental results show that our method is superior to the previous methods. Compared with SMOTE-kNN, SMOTE-Tomek, SMOTE and DBSMOTE, the accuracy of our method is improved by 17.60%, 6.99%, 8.66% and 26.18% on average respectively.
What problem does this paper attempt to address?