Code semantic enrichment for deep code search
Zhongyang Deng,Ling Xu,Chao Liu,Luwen Huangfu,Meng Yan
DOI: https://doi.org/10.1016/j.jss.2023.111856
IF: 3.5
2023-10-01
Journal of Systems and Software
Abstract:Code search aims to retrieve code snippets from a large-scale codebase, where the semantics of the searched code match developers' query intent. Code is a low-level implementation of programming intents, but query is always expressed as clear and high-level semantics, which makes it difficult for DL-based approaches to learn the semantic relationship between them. Through a large-scale empirical analysis on more than 2.2 million pairs of Java code and description, we found that the semantics of code and query can be aligned by enriching code with the descriptions of other code in terms of similar implementation. Based on the finding, we propose a code semantic enrichment approach for deep code search, named SemEnr . Specifically, we first enrich semantics for all code snippets in the training and testing data. We estimated the syntactic similarity of each code snippet from the training data and retrieved the most similar one for each. Thereafter, the semantics of one code snippet is represented by its code tokens and the description of the retrieved most similar code. During the model training, we used the attention mechanism to embed pairs of enriched code and query into the shared high-dimensional vector space. To enhance the quality of our learned representations, we integrated a multi-perspective co-attention mechanism, employing Convolutional Neural Networks (CNNs) to capture local correlations between code and query. Finally, we evaluated the effectiveness of our approach by performing experiments on two extensively used Java datasets. Our experimental results reveal that SemEnr achieves an MRR of 0.698 and 0.631, outperforming the best baseline CAT (a state-of-the-art DL-based model) by 19.93% and 18.83%, respectively. In addition, we conducted a user study involving 50 real-world queries to assess SemEnr's performance, and the findings suggest that SemEnr outperformed baseline models by returning more relevant code snippets.
computer science, theory & methods, software engineering