A Method To Identify And Correct Problematic Software Activity Data: Exploiting Capacity Constraints And Data Redundancies

Qimu Zheng,Audris Mockus,Minghui Zhou
DOI: https://doi.org/10.1145/2786805.2786866
2015-01-01
Abstract:Mining software repositories to understand and improve software development is a common approach in research and practice. The operational data obtained from these repositories often do not faithfully represent the intended aspects of software development and, therefore, may jeopardize the conclusions derived from it. We propose an approach to identify problematic values based on the constraints of software development and to correct such values using data redundancies. We investigate the approach using issue and commit data of Mozilla project. In particular, we identified problematic data in four types of events and found the fraction of problematic values to exceed 10% and rapidly rising. We found the corrected values to be 50% closer to the most accurate estimate of task completion time. Finally, we found that the models of time until fix changed substantially when data were corrected, with the corrected data providing a 20% better fit. We discuss how the approach may be generalized to other types of operational data to increase fidelity of software measurement in practice and in research.
What problem does this paper attempt to address?