Abstract:Commit message generation (CMG) is a challenging task in automated software engineering that aims to generate natural language descriptions of code changes for commits. Previous methods all start from the modified code snippets, outputting commit messages through template-based, retrieval-based, or learning-based models. While these methods can summarize what is modified from the perspective of code, they struggle to provide reasons for the commit. The correlation between commits and issues that could be a critical factor for generating rational commit messages is still unexplored.
In this work, we delve into the correlation between commits and issues from the perspective of dataset and methodology. We construct the first dataset anchored on combining correlated commits and issues. The dataset consists of an unlabeled commit-issue parallel part and a labeled part in which each example is provided with human-annotated rational information in the issue. Furthermore, we propose \tool (\underline{Ex}traction, \underline{Gro}unding, \underline{Fi}ne-tuning), a novel paradigm that can introduce the correlation between commits and issues into the training phase of models. To evaluate whether it is effective, we perform comprehensive experiments with various state-of-the-art CMG models. The results show that compared with the original models, the performance of \tool-enhanced models is significantly improved.
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve
This paper attempts to address a significant issue in the task of Commit Message Generation (CMG): existing methods can only extract information from the code itself to generate commit messages but cannot provide the reasons for the modifications. Specifically, current CMG methods mainly rely on code snippet modifications to generate natural language descriptions, but these methods often overlook issues (bug reports or feature requests) related to the commits, which usually contain specific reasons and background information for the modifications.
### Background and Motivation
In the software development process, the role of commit messages is to summarize and explain the purpose of the commits. A high-quality commit message can help code reviewers quickly understand the content and purpose of the commit without delving into complex code. However, manually writing high-quality commit messages is time-consuming and prone to errors. Therefore, researchers have proposed various techniques for automatically generating commit messages, which can convert code modifications into natural language descriptions. Although these methods have improved the generation quality to some extent, they still face the problem of generating meaningless or unreasonable commit messages.
### Main Contributions
1. **High-Quality Commit-Issue Parallel Dataset**: This is the first dataset that can be used to explore the correlation between commits and issues, containing both annotated and unannotated data.
2. **ExGroFi Paradigm**: A new training paradigm called ExGroFi (Extract, Ground, Fine-tune) is proposed, which improves pre-trained CMG models by incorporating reasonable information from issue reports.
3. **Comprehensive Evaluation**: Extensive experimental evaluations of ExGroFi were conducted, including comparative studies, performance validation at each stage, and human evaluations. The results show that ExGroFi significantly enhances the performance of CMG models in multiple aspects.
### Method Overview
1. **Dataset Construction**:
- **Collection**: Collect commit and related issue data from Java projects on GitHub.
- **Text Processing**: Preprocess the collected text data, including replacing URLs, code snippets, etc.
- **Filtering**: Remove automatically generated commit messages and excessively long data.
2. **Annotation**:
- **Definition**: Define two types of fine-grained information: issue types (bug reports, feature requests, enhancements) and status information (actual status and expected status).
- **Annotation Process**: Use the Label Studio platform for annotation, with each issue being classified and status information extracted by five annotators.
3. **ExGroFi Paradigm**:
- **Extraction Phase**: Extract status information from issue reports and use issue types as feature inputs.
- **Grounding Phase**: Use the extracted status information as target outputs to train the pre-trained CMG model, resulting in a refined model Fgrounded.
- **Fine-tuning Phase**: Use code modifications to train Fgrounded, generating more reasonable commit messages.
### Experimental Results
Experimental results show that the ExGroFi paradigm significantly improves the performance of existing CMG models, especially in generating commit messages that include reasons and background information for the modifications. Additionally, human evaluations confirm that the commit messages generated by ExGroFi are improved in terms of reasonableness, comprehensiveness, conciseness, and expressiveness.
### Conclusion
By constructing a high-quality commit-issue parallel dataset and proposing the ExGroFi paradigm, this paper effectively addresses the problem of existing CMG methods being unable to generate reasonable commit messages, providing an important foundation and direction for future related research.