Abstract:Code summarization is to provide a high-level comment for a code snippet that typically describes the function and intent of the given code. Recent years have seen the successful application of data-driven code summarization. To improve the performance of the model, numerous approaches use abstract syntax trees (ASTs) to represent the structural information of the code, which is considered by most researchers to be the main factor that distinguishes code from natural language. Then, such data-driven methods are trained on large-scale labeled datasets to obtain a model with strong generalization capabilities that can be applied to new examples. Nevertheless, we argue that state-of-the-art approaches suffer from two key weaknesses: (1) inefficient encoding of ASTs; (2) reliance on a large labeled corpus for model training. As a result, such drawbacks lead to (1) oversized model, slow training, information loss and instability; (2) inability to be applied to programming languages with only a small amount of labeled data. In light of these weaknesses, we propose PassSum, a code summarization approach that addresses the aforementioned weaknesses via (1) a novel input representation which contains an efficient AST encoding method; (2) introducing three pretraining objectives and pretraining our model with a large amount of (easy-to-obtain) unlabeled data under the guidance of self-supervised learning. Experimental results on code summarization for Java, Python, and Ruby methods demonstrate the superiority of PassSum to state-of-the-art methods. Further experiments demonstrate that the input representation we use has both temporal and spatial advantages in addition to performance leadership. In addition, pretraining is also shown to make the model more generalizable with less labeled data, and also to speed up the convergence of the model during training. Our contributions are as follows. We propose a novel input representation containing an efficient AST encoding method with syntactical information and an additional input part with lexical information. To improve the robust performance of PassSum, we propose to pretrain PassSum using three pretraining objectives on a large amount of unlabeled dataset. We empirically show that PassSum outperforms state-of-the-art code summarizers and demonstrate the time and space efficiency of the input representation used by PassSum. image

Achieving High-Level Software Component Summarization via Hierarchical Chain-of-Thought Prompting and Static Code Analysis

Interactive Abstract Interpretation with Demanded Summarization

A review of automatic source code summarization

Source Code Summarization in the Era of Large Language Models

Improved Automatic Summarization of Subroutines via Attention to File Context

Icing on the Cake: Automatic Code Summarization at Ericsson

A Survey of Automatic Source Code Summarization

CodeSum: Translate Program Language to Natural Language

Automatic Code Summarization: A Systematic Literature Review

When Do Program-of-Thought Works for Reasoning?

Toward Human-Like Summaries Generated from Heterogeneous Software Artefacts

Context-aware Code Summary Generation

An Extractive-and-Abstractive Framework for Source Code Summarization.

A Prompt Learning Framework for Source Code Summarization

Learning to Generate Structured Code Summaries From Hybrid Code Context

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

PassSum: Leveraging Paths of Abstract Syntax Trees and Self‐supervision for Code Summarization

SUMMIT: Scaffolding OSS Issue Discussion Through Summarization

COMCAT: Leveraging Human Judgment to Improve Automatic Documentation and Summarization

Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension

Improving Automatic Source Code Summarization Via Deep Reinforcement Learning