Abstract:<h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Context:</h3><p>Generally, there are more non-defective instances than defective instances in the datasets used for software defect prediction (SDP), which is referred to as the class imbalance problem. Oversampling techniques are frequently adopted to alleviate the problem by generating new synthetic defective instances. Existing techniques generate either near-duplicated instances which result in overgeneralization (high probability of false alarm, <span class="math"><math>pf</math></span>) or overly diverse instances which hurt the prediction model's ability to find defects (resulting in low probability of detection, <span class="math"><math>pd</math></span>). Furthermore, when existing oversampling techniques are applied in SDP, the effort needed to inspect the instances with different complexity is not taken into consideration.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Objective:</h3><p>In this study, we introduce Complexity-based OverSampling TEchnique (COSTE), a novel oversampling technique that can achieve low <span class="math"><math>pf</math></span> and high <span class="math"><math>pd</math></span> simultaneously. Meanwhile, COSTE also performs better in terms of <span class="math"><math>Norm(popt)</math></span> and <span class="math"><math>ACC</math></span>, two effort-aware measures that consider the testing effort.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Method:</h3><p>COSTE combines pairs of defective instances with similar complexity to generate synthetic instances, which improves the diversity within the data, maintains the ability of prediction models to find defects, and takes the different testing effort needed for different instances into consideration. We conduct experiments to compare COSTE with Synthetic Minority Oversampling TEchnique, Borderline-SMOTE, Majority Weighted Minority Oversampling TEchnique and MAHAKIL.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Results:</h3><p>The experimental results on 23 releases of 10 projects show that COSTE greatly improves the diversity of the synthetic instances without compromising the ability of prediction models to find defects. In addition, COSTE outperforms the other oversampling techniques under the same testing effort. The statistical analysis indicates that COSTE's ability to outperform the other oversampling techniques is significant under the statistical Wilcoxon rank sum test and Cliff's effect size.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Conclusion:</h3><p>COSTE is recommended as an efficient alternative to address the class imbalance problem in SDP.</p>

Genetic algorithm-based oversampling approach to prune the class imbalance issue in software defect prediction

Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction

Support Vector based Oversampling Technique for Handling Class Imbalance in Software Defect Prediction

Over-sampling method for tackling class imbalance in software defect prediction based on generative adversarial networks

Alleviating Class Imbalance Issue in Software Fault Prediction Using DBSCAN-Based Induced Graph Under-Sampling Method

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Adaptive Centre-Weighted Oversampling for Class Imbalance in Software Defect Prediction

Tackling Class Imbalance Problem In Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction

A Software Defect Prediction Method That Simultaneously Addresses Class Overlap and Noise Issues after Oversampling

The Integrity of Machine Learning Algorithms against Software Defect Prediction

KCO: Balancing class distribution in just-in-time software defect prediction using kernel crossover oversampling

A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis

An Improved Semi-Supervised Learning Method for Software Defect Prediction.

Data Augmentation Classifier for Imbalanced Fault Classification

A Survey of Different Approaches for the Class Imbalance Problem in Software Defect Prediction

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction

A Novel Adaptive Undersampling Framework for Class-Imbalance Fault Detection