IBFET: Index‐based Features Extraction Technique for Scalable Code Clone Detection at File Level Granularity

Junaid Akram,Majid Mumtaz,Ping Luo
DOI: https://doi.org/10.1002/spe.2759
2019-01-01
Software Practice and Experience
Abstract:SummaryMany techniques have been developed over the years to detect code clones in different software systems to maintain security measures. These techniques often require the source code to compare the subject system against a very large data set of big code. This paper presents index‐based features extraction technique (IBFET) to detect code clones at a very large‐scale level to billions of LOC at file level granularity. We performed preprocessing, indexing, and clone detection for more than 324 billion of LOC using a Hadoop distributed environment, which is quite faster and more efficient as compared to existing distributed indexing and clone detection techniques; meanwhile, it detects all three types of clones efficiently. The MapReduce rule of divide and conquer is used for a count and retrieve the similar features between different systems. We evaluated the execution time, scalability, precision, and recall of IBFET by using a well‐known clone detection data set IJaDataset and BigCloneBench; furthermore, we compared the results with other state‐of‐the‐art tools. Our approach is faster, flexible, scalable, and provides accurate results with high authenticity and can be implemented at a large‐scale level.
What problem does this paper attempt to address?