Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation
Y. T. Liang,Guisheng Fan,Mingchen Li,Zijie Huang
DOI: https://doi.org/10.2139/ssrn.4694357
2024-01-01
Abstract:Automatic code summarization, the process of automatically generating concise natural language descriptions for code snippets, is critical for enhancing the efficiency of program understanding for software developers and maintainers. Despite the impressive strides made by deep learning-based methods, which have leveraged insights from neural machine translation (NMT) research in the field of natural language processing (NLP), there still exist limitations in their ability of understanding and modeling semantic information due to the unique nature of programming languages. In response, we propose two methods to boost the performance of code summarization models: context-based code abbreviation expansion and unigram language model-based subword segmentation. We employ a series of heuristics to expand abbreviations within identifiers, thereby eliminating the semantic ambiguity associated with these abbreviations and enhancing the language alignment capabilities of code summarization models. Furthermore, we leverage the subword segmentation algorithm to tokenize code into more granular subword sequences, which infuses more semantic information into the training and inference stages of the models, thereby augmenting their program understanding ability. These proposed methods are model-agnostic and can be readily integrated into existing automatic code summarization approaches. Experiments conducted on two widely used Java code summarization datasets demonstrated the effectiveness of these methods. Specifically, by fusing representations of both original and modified codes into the prevailing Transformer model, our presented Semantic Enhanced Transformer for Code Summarization (SETCS) is capable of serving as a robust baseline at the semantic level. Notably, by simply modifying the datasets, our methods achieved performance improvements of up to 7.3%, 10.0%, and 6.7% for representative code summarization models in terms of BLEU-4, METEOR, and ROUGE-L, respectively.