A Procedure to Continuously Evaluate Predictive Performance of Just-In-Time Software Defect Prediction Models During Software Development

Liyan Song,Leandro L. Minku
DOI: https://doi.org/10.1109/TSE.2022.3158831
IF: 7.4
2023-02-01
IEEE Transactions on Software Engineering
Abstract:Just-In-Time Software Defect Prediction (JIT-SDP) uses machine learning to predict whether software changes are defect-inducing or clean. When adopting JIT-SDP, changes in the underlying defect generating process may significantly affect the predictive performance of JIT-SDP models over time. Therefore, being able to continuously track the predictive performance of JIT-SDP models during the software development process is of utmost importance for software companies to decide whether or not to trust the predictions provided by such models over time. However, there has been little discussion on how to continuously evaluate predictive performance in practice, and such evaluation is not straightforward. In particular, labeled software changes that can be used for evaluation arrive over time with a delay, which in part corresponds to the time we have to wait to label software changes as ‘clean’ (waiting time). A clean label assigned based on a given waiting time may not correspond to the true label of the software changes. This can potentially hinder the validity of any continuous predictive performance evaluation procedure for JIT-SDP models. This paper provides the first discussion of how to continuously evaluate predictive performance of JIT-SDP models over time during the software development process, and the first investigation of whether and to what extent waiting time affects the validity of such continuous performance evaluation procedure in JIT-SDP. Based on 13 GitHub projects, we found that waiting time had a significant impact on the validity. Though typically small, the differences in estimated predicted performance were sometimes large, and thus inappropriate choices of waiting time can lead to misleading estimations of predictive performance over time. Such impact did not normally change the ranking between JIT-SDP models, and thus conclusions in terms of which JIT-SDP model performs better are likely reliable independent of the choice of waiting time, especially when considered across projects.
Computer Science
What problem does this paper attempt to address?