Be Careful of When: an Empirical Study on Time-Related Misuse of Issue Tracking Data.

Feifei Tu,Jiaxin Zhu,Qimu Zheng,Minghui Zhou
DOI: https://doi.org/10.1145/3236024.3236054
2018-01-01
Abstract:Issue tracking data have been used extensively to aid in predicting or recommending software development practices. Issue attributes typically change over time, but users may use data from a separate time of data collection rather than the time of their application scenarios. We, therefore, investigate data leakage, which results from ignoring the chronological order in which the data were produced. Information leaked from the "future" makes prediction models misleadingly optimistic. We examine existing literature to confirm the existence of data leakage and reproduce three typical studies (detecting duplicate issues, localizing issues, and predicting issue-fix time) adjusted for appropriate data to quantify the impact of the data leakage. We confirm that 11 out of 58 studies have leakage problem, while 44 are suspected. We observe biased results caused by data leakage while the extent is not striking. Attributes of summary, component, and assignee have the largest impact on the results. Our findings suggest that data users are often unaware of the context of the data being produced. We recommend researchers and practitioners who attempt to utilize issue tracking data to address software development problems to have a full understanding of their application scenarios, the origin and change of the data, and the influential issue attributes to manage data leakage risks.
What problem does this paper attempt to address?