Abstract:Code clone refers to more than two duplicate or similar code fragments existing in a software system. Code clone is a common phenomenon during software development which can facilitate development and has positive impacts on software system. However, research shows that code clone will also do harm to the development and maintenance of software system, including but not limited to the decline of stability, redundancy of source code repository and propagation of software defects. Code clone is one of the most active research areas in software engineering. Therefore, various detection techniques are proposed to automatically detect code clone in software systems, which help improve software quality. There are a lot of achievements in this area, and these techniques can be categorized to text-based, lexisbased, syntax-based and semantic-based categories. Current techniques have obtained effective results in text-based clone detection, but still challenges in detecting other types of code clone. More advanced and unified theoretic and technical guidelines are needed to improve code clone detection techniques. Therefore, in this paper, we present a literature review for code detection especially from the perspective of source code representation. In summary, the contributions of this paper are: (1) We conclude and classify current code clone detection techniques from the perspective of code representation; (2) We conclude the model validation and performance measures in model evaluation; and (3) We summarize the key issues of code clone research from three aspects: scientific, practical and technical difficulties. We elaborate on the possible solutions to the problems and the future development of the research, focusing on data annotation, characteri zation methods, model construction and engineering practice.

SimClone: Detecting Tabular Data Clones using Value Similarity

SimClone: Detecting Tabular Data Clones using Value Similarity

Code Clone Detection: A Literature Review

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones Via Deep Learning

Detecting Differences Across Multiple Instances of Code Clones

A Machine Learning Based Framework for Code Clone Validation

Code Similarity in Clone Detection

Learning to Detect Table Clones in Spreadsheets.

DroidCC: A Scalable Clone Detection Approach for Android Applications to Detect Similarity at Source Code Level.

Go-clone: Graph-Embedding Based Clone Detector for Golang

An ensemble learning approach for software semantic clone detection

Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection

A Survey on the Evaluation of Clone Detection Performance and Benchmarking

Code Clone Detection Method for Large-Scale Source Code

Gitor: Scalable Code Clone Detection by Building Global Sample Graph

GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench

Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree

SCDetector

Clone Detection on Large Scala Codebases

EClone: detect semantic clones in Ethereum via symbolic transaction sketch.

Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection