Abstract:Software clone detection identifies similar or identical code snippets. It has been an active research topic that attracts extensive attention over the last two decades. In recent years, machine learning (ML) based detectors, especially deep learning-based ones, have demonstrated impressive capability on clone detection. It seems that this longstanding problem has already been tamed owing to the advances in ML techniques. In this work, we would like to challenge the robustness of the recent ML-based clone detectors through code semantic-preserving transformations. We first utilize fifteen simple code transformation operators combined with commonly-used heuristics (i.e., Random Search, Genetic Algorithm, and Markov Chain Monte Carlo) to perform equivalent program transformation. Furthermore, we propose a deep reinforcement learning-based sequence generation (DRLSG) strategy to effectively guide the search process of generating clones that could escape from the detection. We then evaluate the ML-based detectors with the pairs of original and generated clones. We realize our method in a framework named CloneGen (stands for Clone Generator). CloneGen In evaluation, we challenge the three state-of-the-art ML-based detectors and four traditional detectors with the code clones after semantic-preserving transformations via the aid of CloneGen. Surprisingly, our experiments show that, despite the notable successes achieved by existing clone detectors, the ML models inside these detectors still cannot distinguish numerous clones produced by the code transformations in CloneGen. In addition, adversarial training of ML-based clone detectors using clones generated by CloneGen can improve their robustness and accuracy. Meanwhile, compared with the commonly-used heuristics, the DRLSG strategy has shown the best effectiveness in generating code clones to decrease the detection accuracy of the ML-based detectors. Our investigation reveals an explicable but always ignored robustness issue of the latest ML-based detectors. Therefore, we call for more attention to the robustness of these new ML-based detectors.

Can Neural Clone Detection Generalize to Unseen Functionalitiesƒ

Code Clone Detection: A Literature Review

Neural Detection of Semantic Code Clones Via Tree-Based Convolution

Challenging Machine Learning-based Clone Detectors via Semantic-preserving Code Transformations

Focus : Function clone identification on cross‐platform

Functional Code Clone Detection with Syntax and Semantics Fusion Learning

Assessing and Improving Dataset and Evaluation Methodology in Deep Learning for Code Clone Detection

An ensemble learning approach for software semantic clone detection

GRRLN: Gated Recurrent Residual Learning Networks for code clone detection

Are our clone detectors good enough? An empirical study of code effects by obfuscation

Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey

Go-clone: Graph-Embedding Based Clone Detector for Golang

Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones Via Deep Learning

Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection

A Machine Learning Based Framework for Code Clone Validation

On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations

Low-Complexity Code Clone Detection Using Graph-based Neural Networks

Learn to Align - A Code Alignment Network for Code Clone Detection.

ClonalNet: Classifying Better by Focusing on Confusing Categories

Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree

Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection