Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction
Shuo Feng,Jacky Keung,Xiao Yu,Yan Xiao,Miao Zhang
DOI: https://doi.org/10.1016/j.infsof.2021.106662
IF: 3.9
2021-11-01
Information and Software Technology
Abstract:<h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Context:</h3><p>In practice, software datasets tend to have more non-defective instances than defective ones, which is referred to as the class imbalance problem in software defect prediction (SDP). Synthetic Minority Oversampling TEchnique (SMOTE) and its variants alleviate the class imbalance problem by generating synthetic defective instances. SMOTE-based oversampling techniques were widely adopted as the baselines to compare with the newly proposed oversampling techniques in SDP. However, randomness is introduced during the procedure of SMOTE-based oversampling techniques. If the performance of SMOTE-based oversampling techniques is highly unstable, the conclusion drawn from the comparison between SMOTE-based oversampling techniques and the newly proposed techniques may be misleading and less convincing.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Objective:</h3><p>This paper aims to investigate the stability of SMOTE-based oversampling techniques. Moreover, a series of stable SMOTE-based oversampling techniques are proposed to improve the stability of SMOTE-based oversampling techniques.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Method:</h3><p>Stable SMOTE-based oversampling techniques reduce the randomness in each step of SMOTE-based oversampling techniques by selecting defective instances in turn, distance-based selection of <span class="math"><math>K</math></span> neighbor instances, and evenly distributed interpolation. Besides, we mathematically prove and also empirically investigate the stability of SMOTE-based and stable SMOTE-based oversampling techniques on four common classifiers across 26 datasets in terms of AUC, <span class="math"><math>balance</math></span>, and MCC.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Results:</h3><p>The analysis of SMOTE-based and stable SMOTE-based oversampling techniques shows that the performance of stable SMOTE-based oversampling techniques is more stable and better than that of SMOTE-based oversampling techniques. The difference between the worst and best performances of SMOTE-based oversampling techniques is up to 23.3%, 32.6%, and 204.2% in terms of AUC, <span class="math"><math>balance</math></span>, and MCC, respectively.</p><h3 class="u-h4 u-margin-m-top u-margin-xs-bottom">Conclusion:</h3><p>Stable SMOTE-based oversampling techniques should be considered as a drop-in replacement for SMOTE-based oversampling techniques.</p>
computer science, information systems, software engineering