Alleviating Class Imbalance Issue in Software Fault Prediction Using DBSCAN-Based Induced Graph Under-Sampling Method

Kirti Bhandari,Kuldeep Kumar,Amrit Lal Sangal
DOI: https://doi.org/10.1007/s13369-024-08740-0
IF: 2.807
2024-02-18
Arabian Journal for Science and Engineering
Abstract:Software fault prediction aims to improve software quality by anticipating faults early in the software development process. It is possible to anticipate software faults with the aid of specific machine learning techniques. However, certain conventional learning techniques may struggle with the datasets' uneven class distribution since they are more skewed toward the non-faulty class and produce inaccurate prediction results. It becomes more complicated when there is a presence of noise in imbalanced datasets. Fixing these data quality concerns is crucial for improving the classifier's output. Existing under-sampling algorithms have the limitation of not addressing the elimination of noisy occurrences prior to the under-sampling technique, which may lead to the loss of valuable data. To remove noisy instances from majority classes and prevent the loss of important data from the majority of instances, clustering using a graph-based under-sampling technique is applied. This paper proposes an improved DBSCAN-based Induced graph under-sampling method (IDBIG-US), which comprises Density-Based Spatial Clustering Applications with Noise (DBSCAN) to filter noisy instances and ShapeGraph to deal with imbalanced classes. 19 benchmark datasets from the PROMISE repository are used in the experiment to demonstrate the effectiveness of the proposed method. The proposed method effectively eliminates noise and improves the classifier's performance. The experimental results and statistical analysis indicate that the proposed method outperforms state-of-the-art methods with respect to Area Under the Curve (AUC), G-mean, Recall (PD), Probability of False alarms (PF), Point of Predictive Task (P opt ), and Area Under the Cumulative Count Curve (ACC).
multidisciplinary sciences
What problem does this paper attempt to address?