Automated Commit Intelligence by Pre-training
Shangqing Liu,Yanzhou Li,Xiaofei Xie,Wei Ma,Guozhu Meng,Yang Liu
DOI: https://doi.org/10.1145/3674731
IF: 3.685
2024-07-01
ACM Transactions on Software Engineering and Methodology
Abstract:GitHub commits, which record the code changes with natural language messages for description, play a critical role in software developers’ comprehension of software evolution. Due to their importance in software development, several learning-based works are conducted for GitHub commits, such as commit message generation and security patch identification. However, most existing works focus on customizing specialized neural networks for different tasks. Inspired by the superiority of code pre-trained models, which has confirmed their effectiveness across different downstream tasks, to promote the development of open-source software community, we first collect a large-scale commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation, and contrastive learning) for six pre-training tasks to learn commit fragment representations. Our model is evaluated on one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBART significantly outperforms previous pre-trained works for code. Further analysis also reveals that each pre-training task enhances the model performance.
computer science, software engineering