Reducing the Impact of Time Evolution on Source Code Authorship Attribution via Domain Adaptation

Zhen Li,Shasha Zhao,Chen Chen,Qian Chen
DOI: https://doi.org/10.1145/3652151
IF: 3.685
2024-03-11
ACM Transactions on Software Engineering and Methodology
Abstract:Source code authorship attribution is an important problem in practical applications such as plagiarism detection, software forensics, and copyright disputes. Recent studies show that existing methods for source code authorship attribution can be significantly affected by time evolution, leading to a decrease in attribution accuracy year by year. To alleviate the problem that Deep Learning (DL)-based source code authorship attribution degrading in accuracy due to time evolution, we propose a new framework called Time D omain A daptation (TimeDA) by adding new feature extractors to the original DL-based code attribution framework that enhances the learning ability of the original model on source domain features without requiring new or more source data. Moreover, we employ a centroid-based pseudo-labeling strategy using neighborhood clustering entropy for adaptive learning to improve the robustness of DL-based code authorship attribution. Experimental results show that TimeDA can significantly enhance the robustness of DL-based source code authorship attribution to time evolution, with an average improvement of 8.7% on the Java dataset and 5.2% on the C++ dataset. In addition, our TimeDA benefits from employing the centroid-based pseudo-labeling strategy, which significantly reduced the model training time by 87.3% compared to traditional unsupervised domain adaptive methods.
computer science, software engineering
What problem does this paper attempt to address?