Oreo: Scaling Clone Detection Beyond Near-Miss Clones

Vaibhav Saini,Farima Farmahinifarahani,Hitesh Sajnani,Cristina Lopes
DOI: https://doi.org/10.1007/978-981-16-1927-4_5
2021-01-01
Abstract:With recent advancements in the field of code clone detection, researchers have made it possible to scale large datasets. The scope of scalable and accurate clone detection, however, was limited to Type-1, Type-2, and near-miss Type-3 clones. Most clone detectors fail to detect clones beyond the near-miss Type-3 category as it becomes hard to detect such clones in a scalable manner. There are two main challenges in identifying clones beyond the Type-3 category: (1) Syntactical similarity is low between such complex clones and (2) comparing code snippets leads to prohibitive quadratic comparisons, which causes candidate explosion and leads to scalability issues. Oreo introduces a novel semantic filter named Action filter  which filters out a large number of code pairs that do not share semantic similarities, thereby addressing the candidate explosion issue. Moreover, the candidates that pass this filter have high semantic similarity which leads to the detection of complex and semantically similar clones. As many semantically similar candidates may not be clones, Oreo uses a deep learning model to validate the structural similarity between the semantically similar candidates, which leads to greater accuracy in clone detection. Oreo demonstrated broader range of clone detection, high recall, precision, speed, and ability to scale to a large inter-project repository (250MLOC) using a standard workstation. This chapter aims to describe the design decisions and concepts which enabled Oreo to take scalable and accurate clone detection beyond the near-miss clones.
What problem does this paper attempt to address?