Source code expert identification: Models and application

Otávio Cury,Guilherme Avelino,Pedro Santos Neto,Marco Túlio Valente,Ricardo Britto
DOI: https://doi.org/10.1016/j.infsof.2024.107445
IF: 3.9
2024-03-19
Information and Software Technology
Abstract:Context: Identifying source code expertise is useful in several situations. Activities like bug fixing and helping newcomers are best performed by knowledgeable developers. Some studies have proposed repository-mining techniques to identify source code experts. However, there is a gap in understanding which variables are most related to code knowledge and how they can be used for identifying expertise. Objective: This study explores models of expertise identification and how these models can be used to improve a Truck Factor algorithm. Methods: First, we built an oracle with the knowledge of developers from software projects. Then, we use this oracle to analyze the correlation between measures from the development history and source code knowledge. We investigate the use of linear and machine-learning models to identify file experts. Finally, we use the proposed models to improve a Truck Factor algorithm and analyze their performance using data from public and private repositories. Results: First Authorship and Recency of Modification have the highest positive and negative correlations with source code knowledge, respectively. Machine learning classifiers outperformed the linear techniques ( F-Score = 71% to 73%) in the largest analyzed dataset, but this advantage is unclear in the smallest one. The Truck Factor algorithm using the proposed models could handle developers missed by the previous expertise model with the best average F-Score of 74%. It was perceived as more accurate in computing the Truck Factor of an industrial project. Conclusion: If we analyze F-Score , the studied models have similar performance. However, machine learning classifiers get higher Precision while linear models obtained the highest Recall . Therefore, choosing the best technique depends on the user's tolerance to false positives and negatives. Additionally, the proposed models significantly improved the accuracy of a Truck Factor algorithm, affirming their effectiveness in precisely identifying the key developers within software projects.
computer science, information systems, software engineering
What problem does this paper attempt to address?