Tackling Class Imbalance Problem In Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

Lina Gong,Shujuan Jiang,Li Jiang
DOI: https://doi.org/10.1109/ACCESS.2019.2945858
IF: 3.9
2019-01-01
IEEE Access
Abstract:In practice, Software Defect Prediction (SDP) models often suffer from highly imbalanced data, which makes classifiers difficult to identify defective instances. Recently, many techniques were proposed to tackle this problem, over-sampling technique is one of the most well-known methods to address class imbalance problem. This technique balances the number of defective and non-defective instances by generating new defective instances. However, these approaches would generate non-diverse synthetic instances, and many unnecessary noise instances at the same time. Motived by this, we propose a Cluster-based Over-sampling with noise filtering (KMFOS) approach to tackle class imbalance problem in SDP. KMFOS firstly divides defective instances into K clusters, and new defective instances are generated by interpolation between instances of each two clusters. After this, these new defective instances would diversely spread in the space of defective dataset. Then, we extend this cluster-based over-sampling through the Closest List Noise Identification (CLNI) to clean the noise instances. We do extensive experiments on 24 projects to compare KMFOS with some over-sampling approaches such as SMOTE, Borderline-SMOTE, ADASYN, random over-sampling (ROS), K-means SMOTE, SMOTE + IPF, SMOTE + ENN and SMOTE + Tomek Links using five prediction classifiers. At the same time, we also compare KMFOS with other state-of-the-art class-imbalance methods including balance bagging classifier, RUS boost classifier, Instance Hardness Threshold and cost-sensitive methods. Experimental results indicate our KMFOS can obtain better Recall and bal values than other over-sampling methods and other compared class-imbalance methods. Hence, KMFOS is an efficient approach to generate balanced data for SDP and improves the performance of predicting models.
What problem does this paper attempt to address?