Genetic algorithm-based oversampling approach to prune the class imbalance issue in software defect prediction

C. Arun,C. Lakshmi
DOI: https://doi.org/10.1007/s00500-021-06112-6
IF: 3.732
2021-08-29
Soft Computing
Abstract:Class imbalance is the potential problem that has been existent in machine learning, which hinders the performance of the classification algorithm when applied in real-world applications such as electricity pilferage, fraudulent transactions, anomaly detection, and prediction of rare diseases. Class imbalance refers to the problem where the distribution of the sample is skewed or biased toward one particular class. Due to its intrinsic nature the software fault prediction dataset falls into the same category where the software modules contain fewer defective modules compared to the non-defective modules. The majority of the oversampling techniques that has been proposed is to address the issue by generating synthetic samples of minority class to balance the dataset. But the synthetic samples generated are near duplicates that also results in over-generalization issue. We thus propose a novel oversampling approach to introduce synthetic samples using genetic algorithm (GA). GA is a form of evolutionary algorithm that employs biologically inspired techniques such as inheritance, mutation, selection, and crossover. The proposed algorithm generates synthetic sample of minority class based on the distribution measure and ensures that the samples are diverse within the class and are efficient. The proposed oversampling algorithm has been compared with SMOTE, BSMOTE, ADASYN, random oversampling, MAHAKIL, and no sampling approach with 20 defect prediction datasets from the promise repository and five prediction models. The results indicate that the genetic algorithm oversampling approach improves the fault prediction performance and reduced false alarm rate.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?