Abstract:Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

Program similarity detection approach based on static lexical tree

Code Clone Detection: A Literature Review

Finding Plagiarism Based on Common Semantic Sequence Model

Academic Source Code Plagiarism Detection by Measuring Program Behavioural Similarity

Code Plagiarism Detection Method Based on Code Similarity and Student Behavior Characteristics

Research on C/C++ Code Static Detection Based on Syntax Tree

Detection of clone sequences and classes using AST

WASTK: A Weighted Abstract Syntax Tree Kernel Method for Source Code Plagiarism Detection

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

A Code Similarity Detection Tool and Its Case Study

Flowchart-Based Cross-Language Source Code Similarity Detection

Research and Implementation of Structure Inspection Algorithm for C Program Code

Automatic Refactoring Method of Cloned Code Using Abstract Syntax Tree and Static Analysis

Code Similarity in Clone Detection

Layered similarity detection for programming plagiarism and collusion on weekly assessments

Neural Detection of Semantic Code Clones Via Tree-Based Convolution

SCDetector

A Proposed Model for Source Code Reuse Detection in Computer Programs

Homework Similarity Detection System Based on Sequence Matching

Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree

Design and Implementation of Code Plagiarism Detection System