Abstract:Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Evaluating the effectiveness of Abstract Syntax Trees (ASTs) in code representation learning**: Although ASTs are widely regarded as an important component in code representation learning, there is currently a lack of systematic and quantitative evaluation to prove whether AST - based code representations are indeed helpful for subsequent code - related tasks. Therefore, the paper explores the effectiveness of ASTs by comparing the performance of token - based (i.e., code token sequence) and AST - based code representation methods on three popular code - related tasks. 2. **Revealing the impact of different AST processing stage selections on code representation and subsequent tasks**: The use of ASTs can be divided into three core and intertwined stages: AST parsing, AST pre - processing, and AST encoding. Each stage has a variety of different methods to choose from, but how these choices affect the final code representation and its performance in subsequent tasks has not been fully studied. The paper experimentally analyzes in detail the impact of different AST parsing, pre - processing, and encoding methods on AST - based code representation and its performance in tasks such as code clone detection, code search, and code summary generation. 3. **Providing guidance on how to effectively utilize ASTs**: Based on the above research, the paper aims to provide detailed guidance for future researchers to help them select appropriate methods at each processing stage, so as to fully utilize the advantages of ASTs, improve the quality of code representation, and improve the performance of code - related tasks. Through a series of experimental designs, the paper not only evaluates the overall performance of ASTs in code representation learning, but also deeply explores the cases where ASTs are superior to tokens on specific sample sets, as well as the specific impacts of different AST processing methods. These research results are of great guiding significance for understanding the application status of ASTs in the field of software engineering and how to further optimize AST - based code representation learning methods.

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

A Comparison of Code Embeddings and Beyond

A Differential Testing Approach for Evaluating Abstract Syntax Tree Mapping Algorithms

When Are Tree Structures Necessary for Deep Learning of Representations?

A Novel Neural Source Code Representation Based on Abstract Syntax Tree.

Xastnn: Improved Code Representations for Industrial Practice

Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Code Completion by Modeling Flattened Abstract Syntax Trees As Graphs

CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

Learning Program Representations with a Tree-Structured Transformer

AstBERT: Enabling Language Model for Financial Code Understanding with Abstract Syntax Trees

TransformCode: A Contrastive Learning Framework for Code Embedding Via Subtree Transformation

AST-trans

Modular Tree Network for Source Code Representation Learning

AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization

Code Representation Learning Using Prüfer Sequences (Student Abstract)

On the Impact of Multiple Source Code Representations on Software Engineering Tasks -- An Empirical Study

Comparing semantic graph representations of source code: The case of automatic feedback on programming assignments

Capturing source code semantics via tree-based convolution over API-enhanced AST

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language