Code Clone Detection: A Literature Review
Qiu-Yuan CHEN,Shan-Ping LI,Meng YAN,Xin XIA
DOI: https://doi.org/10.13328/j.cnki.jos.005711
2019-01-01
Journal of Software
Abstract:Code clone refers to more than two duplicate or similar code fragments existing in a software system. Code clone is a common phenomenon during software development which can facilitate development and has positive impacts on software system. However, research shows that code clone will also do harm to the development and maintenance of software system, including but not limited to the decline of stability, redundancy of source code repository and propagation of software defects. Code clone is one of the most active research areas in software engineering. Therefore, various detection techniques are proposed to automatically detect code clone in software systems, which help improve software quality. There are a lot of achievements in this area, and these techniques can be categorized to text-based, lexisbased, syntax-based and semantic-based categories. Current techniques have obtained effective results in text-based clone detection, but still challenges in detecting other types of code clone. More advanced and unified theoretic and technical guidelines are needed to improve code clone detection techniques. Therefore, in this paper, we present a literature review for code detection especially from the perspective of source code representation. In summary, the contributions of this paper are: (1) We conclude and classify current code clone detection techniques from the perspective of code representation; (2) We conclude the model validation and performance measures in model evaluation; and (3) We summarize the key issues of code clone research from three aspects: scientific, practical and technical difficulties. We elaborate on the possible solutions to the problems and the future development of the research, focusing on data annotation, characteri zation methods, model construction and engineering practice.