An Data Augmentation method for Source Code Summarization
Zixuan Song,Hui Zeng,Xiuwei Shang,Guanxi Li,Hui Li,Shikai Guo
DOI: https://doi.org/10.1016/j.neucom.2023.126385
IF: 6
2023-06-22
Neurocomputing
Abstract:Code comments improve the readability and intelligibility of codes, Unfortunately, code comments are often missing, or outdated in software projects, which negatively affects the efficiency of developers to infer the functionality from source code and affect the efficiency of software maintenance and evolution. To solve this problem, many source code summarization algorithms have been proposed, which automatically generate code comments from source code. However, these methods usually try to collect a large data set which contains the mapping between code comments and source code to train models. However, there are two limitations for the training sets: the insufficient data collection limitation (i.e., generate a large amount of noises-free training data) and data distribution bias limitation (i.e., generate training data for infrequently used methods). To address this issues, we have proposed a data augmentation method for code comments, named CDA-CS. Training models on the augmented dataset, the state-of-the-art algorithms can easily get a further 1.37 % to 2.24 % improvement in terms of different evaluation metrics (i.e., BLUE-4, METEOR, ROUGH_L).
computer science, artificial intelligence