The impact of the distance metric and measure on SMOTE-based techniques in software defect prediction

Shuo Feng,Jacky Keung,Peichang Zhang,Yan Xiao,Miao Zhang
DOI: https://doi.org/10.1016/j.infsof.2021.106742
IF: 3.9
2022-02-01
Information and Software Technology
Abstract:Context: In software defect prediction, SMOTE-based techniques are widely adopted to alleviate the class imbalance problem. SMOTE-based techniques select instances close in the distance to synthesize minority class instances, ensuring few noise instances are generated. Objective: However, recent studies show that selecting instances far away effectively increases the diversity and alleviates the overgeneralization brought by SMOTE-based techniques. To investigate the relationship between the distance of the selected instances and the performances of SMOTE-based techniques, we carry out this study. Method: We first conduct experiments to empirically investigate the impact of the distance between the instances on the performances of three common SMOTE-based techniques. Based on the experimental result, we improve a recently proposed oversampling technique-SMOTUNED. Results: The experimental results on five common classifiers across 30 imbalanced datasets from the PROMISE repository show that (1) the selection of the distance metric has little impact on the performances of SMOTE-based techniques, (2) as long as the number of synthesized noise instances is not beyond the noise-resistant ability of classifiers, the overall performances measured by AUC and b a l a n c e of SMOTE-based techniques are not significantly affected by the distance between instances, and (3) the probability of detection ( p d ) and the probability of false alarm ( p f ) values of SMOTE-based techniques are significantly affected by the distance between the selected instances. The larger the distance between the selected instances is, the lower the p d and p f values SMOTE-based techniques obtain. The performance of the improved SMOTUNED is similar to that of the original SMOTUNED, but the improved SMOTUNED dramatically decreases the execution time of the original SMOTUNED. Conclusion: By controlling the distance, different p d and p f values can be obtained. The diversity of SMOTE-based techniques can be improved, and the overgeneralization can be avoided.
computer science, information systems, software engineering
What problem does this paper attempt to address?