Two-Stage Attention-Based Model for Code Search with Textual and Structural Features
Ling Xu,Huanhuan Yang,Chao Liu,Jianhang Shuai,Meng Yan,Yan Lei,Zhou Xu
DOI: https://doi.org/10.1109/saner50967.2021.00039
2021-01-01
Abstract:Searching and reusing existing code from a large scale codebase can largely improve developers’ programming efficiency. To support code reuse, early code search models leverage information retrieval (IR) techniques to index a large-scale code corpus and return relevant code according to developers’ search query. However, IR-based models fail to capture the semantics in code and query. To tackle this issue, developers applied deep learning (DL) techniques to code search models. However, these models either are too complex to determine an effective method efficiently or learning for semantic correlation between code and query inadequately.To bridge the semantic gap between code and query effectively and efficiently, we propose a code search model TabCS (Two-stage Attention-Based model for Code Search) in this study. TabCS extracts code and query information from the code textual features (i.e., method name, API sequence, and tokens), the code structural feature (i.e., abstract syntax tree), and the query feature (i.e., tokens). TabCS performs a two-stage attention net-work structure. The first stage leverages attention mechanisms to extract semantics from code and query considering their semantic gap. The second stage leverages a co-attention mechanism to capture their semantic correlation and learn better code/query representation. We evaluate the performance of TabCS on two existing large-scale datasets with 485k and 542k code snippets, respectively. Experimental results show that TabCS achieves an MRR of 0.57 on Hu et al.’s dataset, outperforming three state-of-the-art models CARLCS-CNN, DeepCS, and UNIF by 18%, 70%, 12%, respectively. Meanwhile, TabCS gains an MRR of 0.54 on Husain et al.’s, outperforming CARLCS-CNN, DeepCS, and UNIF by 32%, 76%, 29%, respectively.