PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code

Egor Spirin,Egor Bogomolov,Vladimir Kovalenko,Timofey Bryksin
DOI: https://doi.org/10.48550/arXiv.2103.12778
2021-03-24
Abstract:The application of machine learning algorithms to source code has grown in the past years. Since these algorithms are quite sensitive to input data, it is not surprising that researchers experiment with input representations. Nowadays, a popular starting point to represent code is abstract syntax trees (ASTs). Abstract syntax trees have been used for a long time in various software engineering domains, and in particular in IDEs. The API of modern IDEs allows to manipulate and traverse ASTs, resolve references between code elements, etc. Such algorithms can enrich ASTs with new data and therefore may be useful in ML-based code analysis. In this work, we present PSIMiner - a tool for processing PSI trees from the IntelliJ Platform. PSI trees contain code syntax trees as well as functions to work with them, and therefore can be used to enrich code representation using static analysis algorithms of modern IDEs. To showcase this idea, we use our tool to infer types of identifiers in Java ASTs and extend the code2seq model for the method name prediction problem.
Software Engineering,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use the static analysis algorithms in modern integrated development environments (IDEs) to enrich the abstract syntax tree (AST) representation of code, thereby improving the performance of machine - learning (ML) models in software engineering tasks. Specifically, the paper introduces the PSIM INER tool, which can extract the program structure interface (PSI) trees from the IntelliJ Platform and use these trees to enhance code representation. In this way, researchers can utilize the powerful functions of the IDE to improve their ML pipelines without in - depth knowledge of complex source - code processing mechanisms. To demonstrate the capabilities of PSIM INER, the authors used this tool to infer the types of identifiers in the Java AST and extended the code2seq model to solve the method - name prediction problem. The experimental results show that after adding type information, the performance of the model on the method - name prediction task has improved, especially achieving better results in the F1 score. In addition, the authors also found a data leakage problem in the original dataset and cleaned the dataset, further verifying the effectiveness of the model.