Abstract:Context: Identifying source code expertise is useful in several situations. Activities like bug fixing and helping newcomers are best performed by knowledgeable developers. Some studies have proposed repository-mining techniques to identify source code experts. However, there is a gap in understanding which variables are most related to code knowledge and how they can be used for identifying expertise. Objective: This study explores models of expertise identification and how these models can be used to improve a Truck Factor algorithm. Methods: First, we built an oracle with the knowledge of developers from software projects. Then, we use this oracle to analyze the correlation between measures from the development history and source code knowledge. We investigate the use of linear and machine-learning models to identify file experts. Finally, we use the proposed models to improve a Truck Factor algorithm and analyze their performance using data from public and private repositories. Results: First Authorship and Recency of Modification have the highest positive and negative correlations with source code knowledge, respectively. Machine learning classifiers outperformed the linear techniques ( F-Score = 71% to 73%) in the largest analyzed dataset, but this advantage is unclear in the smallest one. The Truck Factor algorithm using the proposed models could handle developers missed by the previous expertise model with the best average F-Score of 74%. It was perceived as more accurate in computing the Truck Factor of an industrial project. Conclusion: If we analyze F-Score , the studied models have similar performance. However, machine learning classifiers get higher Precision while linear models obtained the highest Recall . Therefore, choosing the best technique depends on the user's tolerance to false positives and negatives. Additionally, the proposed models significantly improved the accuracy of a Truck Factor algorithm, affirming their effectiveness in precisely identifying the key developers within software projects.

Authorship Identification Of Source Codes

Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

Authorship attribution of source code by using back propagation neural network based on particle swarm optimization

A Practical Black-Box Attack on Source Code Authorship Identification Classifiers

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution

A3Ident: A Two-phased Approach to Identify the Leading Authors of Android Apps

A3Ident: A Two-phased Approach to Identify the Leading Authors of Android Apps.

Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments

Code stylometry vs formatting and minification

Misleading Authorship Attribution of Source Code using Adversarial Learning

Authorship Identification Based on Semantic Analysis

AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset

SHIELD: Thwarting Code Authorship Attribution

Authorship Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

Adversarial Binaries for Authorship Identification

Source code expert identification: Models and application

I still know it's you! On Challenges in Anonymizing Source Code

Who Made This Copy? An Empirical Analysis of Code Clone Authorship

OCEAN: Open-World Contrastive Authorship Identification

Android Authorship Attribution Using Source Code-Based Features

Assessing Code Authorship: The Case of the Linux Kernel