Goner: Building Tree-Based N-Gram-Like Model for Semantic Code Clone Detection
Yueming Wu,Siyue Feng,Wenqi Suo,Deqing Zou,Hai Jin
DOI: https://doi.org/10.1109/tr.2023.3312294
IF: 5.883
2023-01-01
IEEE Transactions on Reliability
Abstract:Code clone detection refers to the detection of code fragments that are functionally similar. As software engineering progresses, the significance of code clone detection continues to grow. A number of code clone detection techniques have been designed. Among these methods, tree-based code clone detection approaches can discover semantic code clones. However, given the intricate nature of tree structures, they consume plenty of time to complete the tree analysis, thus cannot scale to large-scale code scanning. In this paper, we propose a novel tree-based scalable semantic code clone detection method by transforming the heavy-weight tree processing into efficient N-gram-like subtrees analysis. Specifically, we build a variant of N-gram model to partition the original complex tree into small subtrees. After collecting all subtrees, we divide them into different groups according to the positions of the subtree nodes, and then calculate the similarity of the same group between two functions one by one. Similarity scores of all groups are made up of a feature vector. Given feature vectors, we train a machine learning model for semantic code clone detection. We implement Goner and conduct evaluations on two extensively utilized datasets, namely BigCloneBench and Google Code Jam. The experimental results indicate that Goner outperforms our comparative systems (i.e. SourcererCC , RtvNN , Deckard , ASTNN , TBCNN , CDLH , Amain , FCCA , DeepSim , and SCDetector ). Additionally, in the context of scalability, Goner demonstrates remarkable speed, being approximately 56 times faster than another advanced tree-based tool, namely ASTNN , when it comes to identifying semantic code clones.