Learning Graph-based Code Representations for Source-level Functional Similarity Detection.

Jiahao Liu,Jun Zeng,Xiang Wang,Zhenkai Liang
DOI: https://doi.org/10.1109/icse48619.2023.00040
2023-01-01
Abstract:Detecting code functional similarity forms the basis of various software engineering tasks. However, the detection is challenging as functionally similar code fragments can be implemented differently, e.g., with irrelevant syntax. Recent studies incorporate program dependencies as semantics to identify syntactically different yet semantically similar programs, but they often focus only on local neighborhoods (e.g., one-hop dependencies), limiting the expressiveness of program semantics in modeling functionalities. In this paper, we present TAILOR that explicitly exploits deep graph-structured code features for functional similarity detection. Given source-level programs, TAILOR first represents them into code property graphs (CPGs) - which combine abstract syntax trees, control flow graphs, and data flow graphs - to collectively reason about program syntax and semantics. Then, TAILOR learns representations of CPGs by applying a CPG-based neural network (CPGNN) to iteratively propagate information on them. It improves over prior work on code representation learning through a new graph neural network (GNN) tailored to CPG structures instead of the off-the-shelf GNNs used previously. We systematically evaluate TAILOR on C and Java programs using two public benchmarks. Experimental results show that TAILOR outperforms the state-of-the-art approaches, achieving 99.8% and 99.9% F-scores in code clone detection and 98.3% accuracy in source code classification.
What problem does this paper attempt to address?