Abstract:As researchers and practitioners apply Machine Learning to increasingly more software engineering problems, the approaches they use become more sophisticated. A lot of modern approaches utilize internal code structure in the form of an abstract syntax tree (AST) or its extensions: path-based representation, complex graph combining AST with additional edges. Even though the process of extracting ASTs from code can be done with different parsers, the impact of choosing a parser on the final model quality remains unstudied. Moreover, researchers often omit the exact details of extracting particular code representations. In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the method name prediction task backed by eight different parsers for the Java language. To unify the process of data preparation with different parsers, we develop SuperParser, a multi-language parser-agnostic library based on PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable for training and evaluation of ML models that work with structural information from source code. Our results demonstrate that trees built by different parsers vary in their structure and content. We then analyze how this diversity affects the models' quality and show that the quality gap between the most and least suitable parsers for both models turns out to be significant. Finally, we discuss other features of the parsers that researchers and practitioners should take into account when selecting a parser along with the impact on the models' quality. The code of SuperParser is publicly available at <a class="link-external link-https" href="https://doi.org/10.5281/zenodo.6366591" rel="external noopener nofollow">this https URL</a>. We also publish Java-norm, the dataset we use to evaluate the models: <a class="link-external link-https" href="https://doi.org/10.5281/zenodo.6366599" rel="external noopener nofollow">this https URL</a>.

Statistical Decision-Tree Models for Parsing

When Are Tree Structures Necessary for Deep Learning of Representations?

An Empirical Comparison of Probability Models for Dependency Grammar

Exploiting limited data for parsing

Evaluation of an Algorithmic‐Level Left‐Corner Parsing Account of Surprisal Effects

Evaluating the Impact of Source Code Parsers on ML4SE Models

Parsing Models for Identifying Multiword Expressions

A Fast Unified Model for Parsing and Sentence Understanding

A Survey of Semantic Parsing Techniques

Global Reasoning over Database Structures for Text-to-SQL Parsing

Monte Carlo Syntax Marginals for Exploring and Using Dependency Parses

Non-Fuchsian extension to the Painlevé test

Researches on Large Scale Corpus-Based Syntactic Pattern Matching

Exploiting Heterogeneous Treebanks for Parsing.

Neural Probabilistic Model for Non-projective MST Parsing

Dynamic Syntax Mapping: A New Approach to Unsupervised Syntax Parsing

A Neural Probabilistic Structured-Prediction Model for Transition-Based Dependency Parsing.

A maximum entropy approach to adaptive statistical language modelling

A Statistical Parsing Framework for Sentiment Classification

AS-Parser: Log Parsing Based on Adaptive Segmentation

A Generative Parser with a Discriminative Recognition Algorithm