Abstract:Code clone detection refers to the discovery of identical or similar code fragments in the code repository. AST-based, PDG-based, and DL-based tools can achieve good results on detecting near -miss clones (i.e., clones with small differences or gaps) by using syntax and semantic information, but they are difficult to apply to large code repositories due to high time complexity. Traditional token -based tools can rapidly detect clones by the low-cost index (i.e., low frequency or k-lines tokens) on sequential source code, but most of them have the poor capability on detecting near-miss clones because of the lack of semantic information.In this study, we propose a fast yet accurate code clone detection tool with the semantic token, called CCSTOKENER. The idea behind the semantic token is to enhance the detection capability of token-based tool via complementing the traditional token with semantic information such as the structural information around the token and its dependency with other tokens in form of n-gram. Specifically, we extract the type of relevant nodes in the AST path of every token and transform these types into a fixed-dimensional vector, then model its semantic information by applying n-gram on its related tokens. Meanwhile, our tool adopts and improves the location-filtration-verification process also used in CCALIGNER and LVMAPPER, during which process we build the low-cost k-tokens index to quickly locate the candidate code blocks and speed up detection efficiency. Our experiments show that CCSTOKENER achieves excellent accuracy on detecting more near-miss clone pairs, which exhibits the best recall on Moderately Type-3 clones and detects more true positive clones on four java open-source projects. Moreover, CCSTOKENER attains the best generalization and transferability compared with two DL-based tools (i.e., ASTNN, TBCCD).(c) 2023 Elsevier Inc. All rights reserved.

A Novel Detection Approach for Statement Clones

Code Clone Detection: A Literature Review

SCDetector

Research on code clone analysis approach

Survey on Software Clone Detection Research

Detecting Differences Across Multiple Instances of Code Clones

A novel code representation for detecting Java code clones using high-level and abstract compiled code representations

Clone flaw detection method based on clone code detection

Lsiccds: Large Scale Incremental Code Clone Detection System

A Novel Code Stylometry-based Code Clone Detection Strategy

Detection of clone sequences and classes using AST

CloneAyz: An Approach for Clone Representation and Analysis

Clone Detection on Large Scala Codebases

CCStokener: Fast Yet Accurate Code Clone Detection with Semantic Token

Survey of research on code clone technique

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree.

CMCD: Count Matrix Based Code Clone Detection

A Large-Gap Clone Detection Approach Using Sequence Alignment Via Dynamic Parameter Optimization.

Code Clone Detection Method for Large-Scale Source Code

DroidCC: A Scalable Clone Detection Approach for Android Applications to Detect Similarity at Source Code Level.

DCCD: an Efficient and Scalable Distributed Code Clone Detection Technique for Big Code