Abstract:Code summarization is to provide a high-level comment for a code snippet that typically describes the function and intent of the given code. Recent years have seen the successful application of data-driven code summarization. To improve the performance of the model, numerous approaches use abstract syntax trees (ASTs) to represent the structural information of the code, which is considered by most researchers to be the main factor that distinguishes code from natural language. Then, such data-driven methods are trained on large-scale labeled datasets to obtain a model with strong generalization capabilities that can be applied to new examples. Nevertheless, we argue that state-of-the-art approaches suffer from two key weaknesses: (1) inefficient encoding of ASTs; (2) reliance on a large labeled corpus for model training. As a result, such drawbacks lead to (1) oversized model, slow training, information loss and instability; (2) inability to be applied to programming languages with only a small amount of labeled data. In light of these weaknesses, we propose PassSum, a code summarization approach that addresses the aforementioned weaknesses via (1) a novel input representation which contains an efficient AST encoding method; (2) introducing three pretraining objectives and pretraining our model with a large amount of (easy-to-obtain) unlabeled data under the guidance of self-supervised learning. Experimental results on code summarization for Java, Python, and Ruby methods demonstrate the superiority of PassSum to state-of-the-art methods. Further experiments demonstrate that the input representation we use has both temporal and spatial advantages in addition to performance leadership. In addition, pretraining is also shown to make the model more generalizable with less labeled data, and also to speed up the convergence of the model during training. Our contributions are as follows. We propose a novel input representation containing an efficient AST encoding method with syntactical information and an additional input part with lexical information. To improve the robust performance of PassSum, we propose to pretrain PassSum using three pretraining objectives on a large amount of unlabeled dataset. We empirically show that PassSum outperforms state-of-the-art code summarizers and demonstrate the time and space efficiency of the input representation used by PassSum. image

Automatic Loop Summarization Via Path Dependency Analysis

Proteus: Computing Disjunctive Loop Summary Via Path Dependency Analysis

Automatic Approach of Generating Summaries for Common Loops and Its Application

Loopster: Static Loop Termination Analysis

LoopSCC: Towards Summarizing Multi-branch Loops within Determinate Cycles

Characteristic Studies of Loop Problems for Structural Test Generation Via Symbolic Execution

Loopster++: Termination Analysis for Multi-path Linear Loop

A Symbolic Execution Guided Inner Loop Bound Analysis

PassSum: Leveraging Paths of Abstract Syntax Trees and Self‐supervision for Code Summarization

Loop Transparency for Scalable Dynamic Symbolic Execution

Loop Reduction Techniques for Reachability Analysis of Linear Hybrid Automata

Learning to Generate Structured Code Summaries From Hybrid Code Context

CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code

Compositional Shape Analysis with Shared Abduction and Biabductive Loop Acceleration

Generating Predicate Callback Summaries for the Android Framework

Knowledge transfer based many-objective approach for finding bugs in multi-path loops

Precondition Calculation for Loops Iterating over Data Structures

Optimize Context-Sensitive Andersen-style Points-to Analysis by Method Summarization and Cycle-Elimination

Data-flow Summarization of Android Library with Points-to Analysis

Do Code Summarization Models Process Too Much Information? Function Signature May Be All What Is Needed

Automated Sensitivity Analysis for Probabilistic Loops