Abstract:(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called Esale, by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that Esale significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by Esale are more informative and closer to the ground-truth summaries.

Contrastive Aligned Joint Learning for Multilingual Summarization.

Multisumm: Towards A Unified Model For Multi-Lingual Abstractive Summarization

ConVerSum: A Contrastive Learning based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism

Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization

Towards Unifying Multi-Lingual and Cross-Lingual Summarization

Sentence salience contrastive learning for abstractive text summarization

NCLS: Neural Cross-Lingual Summarization

An End-to-End Speech Summarization Using Large Language Model

Controllable Multi-document Summarization: Coverage & Coherence Intuitive Policy with Large Language Model Based Rewards

Sequence Level Contrastive Learning for Text Summarization

Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization

Align vision-language semantics by multi-task learning for multi-modal summarization

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

On Learning to Summarize with Large Language Models as References

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

MLSUM: The Multilingual Summarization Corpus