KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation

Wei Tao,Yucheng Zhou,Yanlin Wang,Hongyu Zhang,Haofen Wang,Wenqiang Zhang
2024-01-16
Abstract:Commit messages are natural language descriptions of code changes, which are important for software evolution such as code understanding and maintenance. However, previous methods are trained on the entire dataset without considering the fact that a portion of commit messages adhere to good practice (i.e., good-practice commits), while the rest do not. On the basis of our empirical study, we discover that training on good-practice commits significantly contributes to the commit message generation. Motivated by this finding, we propose a novel knowledge-aware denoising learning method called KADEL. Considering that good-practice commits constitute only a small proportion of the dataset, we align the remaining training samples with these good-practice commits. To achieve this, we propose a model that learns the commit knowledge by training on good-practice commits. This knowledge model enables supplementing more information for training samples that do not conform to good practice. However, since the supplementary information may contain noise or prediction errors, we propose a dynamic denoising training method. This method composes a distribution-aware confidence function and a dynamic distribution list, which enhances the effectiveness of the training process. Experimental results on the whole MCMD dataset demonstrate that our method overall achieves state-of-the-art performance compared with previous methods. Our source code and data are available at
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the paper attempts to solve The problem that this paper attempts to solve is how to improve the generation quality when generating commit messages, especially how to use high - quality commit messages (i.e., commit messages that follow good practices) to improve the generation process. Specifically, the author found that existing methods usually use the entire data set when training models for generating commit messages, without distinguishing that some of the commit messages follow good practices (such as AngularJS rules), while others do not meet these standards. This practice has led to uneven quality of the generated commit messages. ### Main contributions 1. **Empirical research**: Through empirical research, the author found that training with commit messages that follow good practices can significantly improve the quality of generated commit messages. 2. **Proposing the KADEL method**: Based on the above findings, the author proposed a Knowledge - Aware Denoising Learning method (KADEL). This method improves the generation process through the following steps: - **Constructing a knowledge model**: Train a knowledge model that learns to predict the type and scope of commit messages on data that follows good practices. - **Dynamic denoising training**: Designed a dynamic denoising training method that combines a distribution - aware confidence function and a dynamic distribution list to reduce the impact of noise on training. 3. **Experimental verification**: The experimental results show that KADEL has achieved state - of - the - art performance on test sets of multiple programming languages, and each component is effective. ### Key technologies in the solution - **Knowledge model**: By training on data that follows good practices, the knowledge model can supplement the type and scope information in the original commit messages. - **Dynamic denoising training**: To deal with the noise that may be introduced by the knowledge model, the author proposed a dynamic denoising training method. This method uses the Expectation - Maximization (EM) algorithm to infer the distributions of clean data and noisy data, and re - weights the samples according to these distributions, thereby improving the training effect. ### Experimental results The experimental results show that KADEL is superior to other strong competitors in overall performance and performs well on test sets of different programming languages. In addition, the author further verified the effectiveness of the method through extensive analysis and manual evaluation. ### Conclusion By introducing knowledge from commit messages that follow good practices and combining the dynamic denoising training method, KADEL significantly improves the quality of generated commit messages, providing better support for software development and maintenance.