BEQAIN: An Effective and Efficient Identifier Normalization Approach With BERT and the Question Answering System
Jingxuan Zhang,Siyuan Liu,Lina Gong,Haoxiang Zhang,Zhiqiu Huang,He Jiang
DOI: https://doi.org/10.1109/tse.2022.3227559
IF: 7.4
2023-04-21
IEEE Transactions on Software Engineering
Abstract:As one of the most important resources to express the semantics of source code, identifiers are usually composed of several common or domain-specific terms and abbreviations, thus heavily hindering developers from analyzing and comprehending source code. Hence, it is very necessary to normalize identifiers, which aims to align the vocabulary found in identifiers with natural language words found in other software artifacts. Even though researchers have proposed several identifier normalization approaches in the literature, these approaches only rely on the lexical information in identifiers and related source code entities to normalize identifiers, suffering from the lack of deep semantic understanding of identifiers. In this paper, we propose an effective and efficient identifier normalization approach BEQAIN to split identifiers into their composing words and expand the enclosed abbreviations. Specifically, BEQAIN employs a deep learning model, which is mainly composed of a Bidirectional Encoder Representation from Transformers (BERT) layer and a Conditional Random Fields (CRF) layer to embed identifiers into low-level vectors and learn the identifier splitting patterns. The BERT-CRF network is also combined with a pre-processing component and a post-processing component to resolve the problems of over-splitting and under-splitting so as to improve the identifier splitting performance. Furthermore, BEQAIN also employs a Question Answering (Q&A) system to learn the abbreviation expansion mappings and leverages the current programming context to determine the exactly correct expansion when there are multiple expansions for specific abbreviations. After BEQAIN is fully trained, it can be used to normalize identifiers. We conduct extensive experiments to validate the effectiveness and efficiency of BEQAIN over two publicly available datasets with nine projects. Experimental results show that BEQAIN achieves the overall average Accuracy of 80.20- 25; and outperforms the existing state-of-the-art approach by 9.88% in normalizing identifiers. The pre-processing and post-processing components could improve the Accuracy of BEQAIN in identifier splitting by 11.70%. Employing the programming context information could improve the Accuracy of BEQAIN in abbreviation expansion by 11.15% on average. In addition, the average normalization time of BEQAIN is less than one second. Finally, we also discuss some observations for the road ahead for identifier normalization to inspire other researchers.
engineering, electrical & electronic,computer science, software engineering