Abstract:Context: Identifying source code expertise is useful in several situations. Activities like bug fixing and helping newcomers are best performed by knowledgeable developers. Some studies have proposed repository-mining techniques to identify source code experts. However, there is a gap in understanding which variables are most related to code knowledge and how they can be used for identifying expertise. Objective: This study explores models of expertise identification and how these models can be used to improve a Truck Factor algorithm. Methods: First, we built an oracle with the knowledge of developers from software projects. Then, we use this oracle to analyze the correlation between measures from the development history and source code knowledge. We investigate the use of linear and machine-learning models to identify file experts. Finally, we use the proposed models to improve a Truck Factor algorithm and analyze their performance using data from public and private repositories. Results: First Authorship and Recency of Modification have the highest positive and negative correlations with source code knowledge, respectively. Machine learning classifiers outperformed the linear techniques ( F-Score = 71% to 73%) in the largest analyzed dataset, but this advantage is unclear in the smallest one. The Truck Factor algorithm using the proposed models could handle developers missed by the previous expertise model with the best average F-Score of 74%. It was perceived as more accurate in computing the Truck Factor of an industrial project. Conclusion: If we analyze F-Score , the studied models have similar performance. However, machine learning classifiers get higher Precision while linear models obtained the highest Recall . Therefore, choosing the best technique depends on the user's tolerance to false positives and negatives. Additionally, the proposed models significantly improved the accuracy of a Truck Factor algorithm, affirming their effectiveness in precisely identifying the key developers within software projects.

GEMiner: Mining Social and Programming Behaviors to Identify Experts in Github

Mining the Network of the Programmers

SCSMiner: Mining Social Coding Sites for Software Developer Recommendation with Relevance Propagation.

Automatically Deriving Developers’ Technical Expertise from the GitHub Social Network

A Large Scale Study of Long-Time Contributor Prediction for GitHub Projects

Developer Identity Linkage and Behavior Mining Across GitHub and StackOverflow.

Profiling Developer Expertise Across Software Communities with Heterogeneous Information Network Analysis.

A Collaboration-Aware Approach to Profiling Developer Expertise with Cross-Community Data

An Exploratory Research of GitHub Based on Graph Model

DevRank: Mining Influential Developers In Github

Recommending relevant projects via user behaviour: an exploratory study on github.

Automatic Detection of Public Development Projects in Large Open Source Ecosystems: an Exploratory Study on GitHub

g-Miner: Interactive Visual Group Mining on Multivariate Graphs.

User Influence Analysis for Github Developer Social Networks

Exploring the Patterns of Social Behavior in GitHub

Source code expert identification: Models and application

Mining DEV for social and technical insights about software development

Automatically Modeling Developer Programming Ability and Interest Across Software Communities

Investigating Cross-Repository Socially Connected Teams on GitHub

On GitHub's Programming Languages

Influence Analysis of Github Repositories