Combining Code Context and Fine-grained Code Difference for Commit Message Generation

Shengbin Xu,Yuan Yao,Feng Xu,Tianxiao Gu,Hanghang Tong
DOI: https://doi.org/10.1145/3545258.3545274
2022-01-01
Abstract:Generating natural language messages for source code changes is an essential task in software development and maintenance. Existing solutions mainly treat a piece of code difference as natural language, and adopt seq2seq learning to translate it into a commit message. The basic assumption of such solutions lies in the naturalness hypothesis, i.e., source code written by programming languages is to some extent similar to natural language text. However, compared with natural language, source code also bears syntactic regularities. In this paper, we propose to simultaneously model the naturalness and syntactic regularities of source code changes for commit message generation. Specifically, to model syntactic regularities, we first enlarge the input with additional context information, i.e., the code statements that have dependency with the variables in the code difference, and then extract the paths in the corresponding ASTs. Moreover, to better model code difference, we align the two versions of code before and after the committed code change at token level, and annotate their differences with fine-grained edit operations. The context and difference are simultaneously encoded in a learning framework to generate the commit messages. We collected from GitHub a large dataset containing 480 Java projects with over 160k commits, and the experimental results demonstrate the effectiveness of the proposed approach.
What problem does this paper attempt to address?