Improving Large-Gap Clone Detection Recall Using Multiple Features

Peng Dai,Qianjin Zhang,Yawen Wang,Dahai Jin,Yunzhan Gong
DOI: https://doi.org/10.1142/S0218194022500413
IF: 1.007
2022-01-01
International Journal of Software Engineering and Knowledge Engineering
Abstract:Code clone refers to two or more identical or similar source code fragments. Research on code clone detection has lasted for decades. Investigation and evaluation of existing clone detection techniques indicate that they are resilient to function-level clone detection. Still, there may be room for further research in block-level clone detection. Particularly, type-3 clones that include large gaps, are ongoing challenges. To solve these problems, we propose a clone detection method based on multiple code features. It aims to improve the recall rate of code block clone detection and overcome large-gap and hard-to-detect type-3 clones. This method first splits the source code files based on the program's structural features and context features to obtain code blocks. The collection of code blocks obtained in this way is complete, and the large gaps in clone pairs will also be removed. In addition, we only need to compute the similarity between code blocks with the same structural features, which can also significantly save time and resources. The similarity is obtained by calculating the proportion of the same tokens between two code blocks. Moreover, since different types of tokens have different weights in similarity calculation, we use supervised learning to obtain a classifier model between token features and code clone. We divide the tokens into 13 types and train the machine learning model with the manually confirmed clone or non-clone pair. Finally, we develop a prototype system and compare our tools with existing tools under the Mutation Framework and in several actual C projects. The experimental results also demonstrate the advancement and practicality of our prototype.
What problem does this paper attempt to address?