TreeCen: Building Tree Graph for Scalable Semantic Code Clone Detection
Yutao Hu,Deqing Zou,Junru Peng,Yueming Wu,Junjie Shan,Hai Jin
DOI: https://doi.org/10.1145/3551349.3556927
2022-01-01
Abstract:Code clone detection is an important research problem that has attracted wide attention in software engineering. Many methods have been proposed for detecting code clone, among which text-based and token-based approaches are scalable but lack consideration of code semantics, thus resulting in the inability to detect semantic code clones. Methods based on intermediate representations of codes can solve the problem of semantic code clone detection. However, graph-based methods are not practicable due to code compilation, and existing tree-based approaches are limited by the scale of trees for scalable code clone detection. In this paper, we propose TreeCen, a scalable tree-based code clone detector, which satisfies scalability while detecting semantic clones effectively. Given the source code of a method, we first extract its abstract syntax tree (AST) based on static analysis and transform it into a simple graph representation (i.e., tree graph) according to the node type, rather than using traditional heavyweight tree matching. We then treat the tree graph as a social network and adopt centrality analysis on each node to maintain the tree details. By this, the original complex tree can be converted into a 72-dimensional vector while containing comprehensive structural information of the AST. Finally, these vectors are fed into a machine learning model to train a detector and use it to find code clones. We conduct comparative evaluations on effectiveness and scalability. The experimental results show that TreeCen maintains the best performance of the other six state-of-the-art methods (i.e., SourcererCC, RtvNN, DeepSim, SCDetector, Deckard, and ASTNN) with F1 scores of 0.99 and 0.95 on BigCloneBench and Google Code Jam datasets, respectively. In terms of scalability, TreeCen is about 79 times faster than the other state-of-the-art tree-based semantic code clone detector (ASTNN), about 13 times faster than the fastest graph-based approach (SCDetector), and even about 22 times faster than the one-time trained token-based detector (RtvNN).