Abstract:Automatic code summarization, the process of automatically generating concise natural language descriptions for code snippets, is critical for enhancing the efficiency of program understanding for software developers and maintainers. Despite the impressive strides made by deep learning-based methods, which have leveraged insights from neural machine translation (NMT) research in the field of natural language processing (NLP), there still exist limitations in their ability of understanding and modeling semantic information due to the unique nature of programming languages. In response, we propose two methods to boost the performance of code summarization models: context-based code abbreviation expansion and unigram language model-based subword segmentation. We employ a series of heuristics to expand abbreviations within identifiers, thereby eliminating the semantic ambiguity associated with these abbreviations and enhancing the language alignment capabilities of code summarization models. Furthermore, we leverage the subword segmentation algorithm to tokenize code into more granular subword sequences, which infuses more semantic information into the training and inference stages of the models, thereby augmenting their program understanding ability. These proposed methods are model-agnostic and can be readily integrated into existing automatic code summarization approaches. Experiments conducted on two widely used Java code summarization datasets demonstrated the effectiveness of these methods. Specifically, by fusing representations of both original and modified codes into the prevailing Transformer model, our presented Semantic Enhanced Transformer for Code Summarization (SETCS) is capable of serving as a robust baseline at the semantic level. Notably, by simply modifying the datasets, our methods achieved performance improvements of up to 7.3%, 10.0%, and 6.7% for representative code summarization models in terms of BLEU-4, METEOR, and ROUGE-L, respectively.

Project-Specific Code Summarization with In-Context Learning

Iipcs: Intent-Based In-Context Learning for Project-Specific Code Summarization

Source Code Summarization in the Era of Large Language Models

A Prompt Learning Framework for Source Code Summarization

Why My Code Summarization Model Does Not Work

Context-aware Code Summary Generation

Why My Code Summarization Model Does Not Work: Code Comment Improvement with Category Prediction

Leveraging In-and-Cross Project Pseudo-Summaries for Project-Specific Code Summarization

Contextual Information Enhanced Source Code Summarization

ESALE: Enhancing Code-Summary Alignment Learning for Source Code Summarization

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

Low-Resources Project-Specific Code Summarization

Learning to Generate Structured Code Summaries From Hybrid Code Context

Can Large Language Models Serve as Evaluators for Code Summarization?

Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation

Towards Retrieval-Based Neural Code Summarization: A Meta-Learning Approach

Learning a Holistic and Comprehensive Code Representation for Code Summarization.

Active Learning for Low-Resource Project-Specific Code Summarization

ClassSum: a Deep Learning Model for Class-Level Code Summarization

Context-based Transfer Learning for Low Resource Code Summarization