The Impact of Duplicate Changes on Just-in-Time Defect Prediction

Ruifeng Duan,Haitao Xu,Yuanrui Fan,Meng Yan
DOI: https://doi.org/10.1109/tr.2021.3061618
IF: 5.883
2022-01-01
IEEE Transactions on Reliability
Abstract:Recently, just-in-time (JIT) defect prediction technique attracted a lot of attention. In JIT defect prediction, all branches and omitting changes outside the main branch should be considered which can significantly affect the performance of JIT defect prediction. However, there are many duplicate changes among all the branches, which are referred to as a pair of changes with identical implementation in different branches. Such changes can influence the calculation of developer experience metrics and are considered as the noisy data for JIT defect prediction. In this article, the impact of duplicate changes on JIT defect prediction is explored. An empirical study on a total of 105 828 changes from eight Apache open-source projects is given. We find that 13% of changes from different branches are duplicate among the studied projects. The duplicate changes have a great influence on the model metrics for JIT defect prediction. For 50% of the changes, removing duplicate changes decreases the experience metrics with an average of 6–55. In addition, the duplicate changes have a significant impact on the evaluation and interpretation of JIT defect prediction models. Removing duplicate changes among the studied projects can significantly improve the performance of JIT defect prediction models ranging from 1 to 125% concerning various performance measures (i.e., area under the curve, Matthews correlation coefficient, and F1). Given the impact of duplicate changes, we suggest that researchers should remove duplicate changes from the original historical changes of software repository when evaluating the performance of JIT defect prediction models in future work.
What problem does this paper attempt to address?