Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Weisong Sun,Chunrong Fang,Yun Miao,Yudu You,Mengzhe Yuan,Yuchen Chen,Quanjun Zhang,An Guo,Xiang Chen,Yang Liu,Zhenyu Chen
2023-12-01
Abstract:Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.
Software Engineering,Artificial Intelligence,Computation and Language,Programming Languages
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Evaluating the effectiveness of Abstract Syntax Trees (ASTs) in code representation learning**: Although ASTs are widely regarded as an important component in code representation learning, there is currently a lack of systematic and quantitative evaluation to prove whether AST - based code representations are indeed helpful for subsequent code - related tasks. Therefore, the paper explores the effectiveness of ASTs by comparing the performance of token - based (i.e., code token sequence) and AST - based code representation methods on three popular code - related tasks. 2. **Revealing the impact of different AST processing stage selections on code representation and subsequent tasks**: The use of ASTs can be divided into three core and intertwined stages: AST parsing, AST pre - processing, and AST encoding. Each stage has a variety of different methods to choose from, but how these choices affect the final code representation and its performance in subsequent tasks has not been fully studied. The paper experimentally analyzes in detail the impact of different AST parsing, pre - processing, and encoding methods on AST - based code representation and its performance in tasks such as code clone detection, code search, and code summary generation. 3. **Providing guidance on how to effectively utilize ASTs**: Based on the above research, the paper aims to provide detailed guidance for future researchers to help them select appropriate methods at each processing stage, so as to fully utilize the advantages of ASTs, improve the quality of code representation, and improve the performance of code - related tasks. Through a series of experimental designs, the paper not only evaluates the overall performance of ASTs in code representation learning, but also deeply explores the cases where ASTs are superior to tokens on specific sample sets, as well as the specific impacts of different AST processing methods. These research results are of great guiding significance for understanding the application status of ASTs in the field of software engineering and how to further optimize AST - based code representation learning methods.