Impact of methodological choices on the analysis of code metrics and maintenance
Syed Ishtiaque Ahmad,Shaiful Chowdhury,Reid Holmes
DOI: https://doi.org/10.1016/j.jss.2024.112263
IF: 3.5
2024-11-02
Journal of Systems and Software
Abstract:Many statistical analyses and prediction models rely on past data about how a system evolves to learn and anticipate the number of changes and bugs it will have in the future. As a software engineer or data scientist creates these models, they need to make several methodological choices such as deciding on size measurements, whether size should be controlled, from what time range metrics should be obtained, etc. In this work, we demonstrate how different methodological decisions can cause practitioners to reach conclusions that are significantly and meaningfully different. For example, when measuring SLOC from evolving source code of a method, one could decide to use the initial, median, average, final, or a per-change measure of method size. These decisions matter; for instance, some prior studies observed better performance of code metrics for bug prediction in general, while other studies found negative results when performance was evaluated through a time-based approach. Understanding the impact of these different methodological decisions is especially important given the increasing significance of approaches that use these large datasets for software analysis tasks. This paper can impact both practitioners and researchers by helping them understand which of the methodological choices underpinning their analyses are important, and which are not; this can lead to more consistency among research studies and improved decision-making for deployed analyses.
computer science, theory & methods, software engineering