Code Clone Detection Method for Large-Scale Source Code

Ying GUO,Fenghong CHEN,Minghui ZHOU
DOI: https://doi.org/10.3778/j.issn.1673-9418.1311018
2014-01-01
Abstract:The benefits of detecting code clones include detecting plagiarism and copyright infringement, helping in code compacting, error detecting, and finding usage patterns et al. The existing clone detection tools usually use com-plicated algorithm, or need lots of computing resources, so they can not be applied to detect code clones on large-scale code data. In order to implement code clone detection on massive data, this paper proposes a new code clone detection algorithm. The algorithm combines the idea of content-defined chunking (CDC) in data de-duplication and that of Simhash algorithm in finding duplicate webpage, and uses the method of first chunking then fuzzy matching. The algorithm is implemented on a data source which contains more than 500 million files of 10 TB from a variety of open source projects. This paper compares the influence of choosing different chunk lengths on detection rate and detection time. The experimental results show that the new algorithm can be applied not only to detect large scale code clones, but also to detect some Type 3 clones, with a high detection precision.
What problem does this paper attempt to address?