How to Better Utilize Code Graphs in Semantic Code Search?
Yucen Shi,Ying Yin,Zhengkui Wang,David Lo,Tao Zhang,Xin Xia,Yuhai Zhao,Bowen Xu
DOI: https://doi.org/10.1145/3540250.3549087
2022-01-01
Abstract:Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs. In this paper, we propose a Graph-to-Sequence Converter, namely G 2 SC . Through converting the code graphs into lossless sequences, G 2 SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G 2 SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G 2 SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G 2 SC enabled models: GSMM ( G 2 SC enabled multi-modal model) and GSCodeBERT ( G 2 SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively.