An Empirical Study on the Impact of Class Overlapin Just-in-Time Software Defect Prediction (S).

Minyang Yi,Guisheng Fan,Huiqun Yu,Xingguang Yang
DOI: https://doi.org/10.18293/seke2021-076
2021-01-01
Abstract:Just-in-time software defect prediction (JIT-SDP) is an active research topic in the field of software engineering, aiming at identifying defect-inducing code changes. Most of the current JIT-SDP work focused on model construction. It is often ignored that the performance of classifiers often depends on high quality data. In this paper, we first investigate the impact of the class overlap problem on the performance of the classifiers in JIT-SDP, and propose a new effective preprocessing method (IKMCCA-TL) combining improved K-Means clustering cleaning approach and Tomek-link method. In order to objectively estimate the impact of class overlap on the classifiers in JITSDP, we conduct a large-scale empirical study on the data sets of six open source projects and compare the performance of LR, RF and KNN classifiers by using IKMCCA or KMCCA or NCL and without cleaning data. Experimental results show that after removing overlapping instances, the performance of the classifiers is significantly improved in terms of balance, recall and AUC and our proposed method achieves the best performance.
What problem does this paper attempt to address?