Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

C. Arun,C. Lakshmi
DOI: https://doi.org/10.1007/s13198-023-02031-x
2023-07-27
International Journal of System Assurance Engineering and Management
Abstract:Advancement in the field of Artificial Intelligence and Machine Learning has paved the way to enhance the quality of software by creating advanced testing tools and enhanced Software Defect Prediction (SDP) models. Significant growth in the need of Software component across domains impose significant challenge on the complexity and reliability of the software component. However, the Software Practitioners try to create advanced SDP models to find defect-prone modules effectively. The Performance of the prediction model is correlated with the quality and quantity of the dataset used. Over the years researchers have contributed numerous works to counter the class-imbalance issue in the SDP model by using data sampling, ensemble learning and cost-sensitive learning. However, the smaller disjuncts is also other factor which impact the performance of SDP model. To counter both class imbalance and smaller disjuncts, we proposed a multipatch cluster based oversampling approach which generating synthetic samples to balance the class and ensure the samples reside within class boundary and eliminate the possibility of minority samples evade decision boundary. Initial Population was divided into two groups majority and minority samples respectively. Mahalanobis distance is used to calculate the diversity of individual samples of the minority cluster from the population. Then, samples are placed into various clusters, and by taking into account the density, synthetic samples are introduced into each cluster. Five different machine learning models have been used to test the performance of the proposed approach. The experimental findings demonstrate the effectiveness of the algorithm by proving that the proposed approach offers better performance in terms of a reduced false alarm rate.
What problem does this paper attempt to address?