Abstract:Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

Similar Code Retrieval Based on the Clustering of Structural Features

Detection of Semantically Similar Code

STVsm: Similar Structural Code Detection Based on AST and VSM

Code Clone Detection: A Literature Review

Structural Function Based Code Clone Detection Using a New Hybrid Technique

A New Method for Code Similarity Detection

FSD-CLCD: Functional Semantic Distillation Graph Learning for Cross-Language Code Clone Detection

Flowchart-Based Cross-Language Source Code Similarity Detection

Detect Functionally Equivalent Code Fragments Via K-Nearest Neighbor Algorithm

Program similarity detection approach based on static lexical tree

Code Similarity Detection by Program Dependence Graph

A Code Similarity Detection Tool and Its Case Study

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree.

Code Clone Restructuring of C Programs Via K-nearest Neighbor Algorithm

Hierarchical Attention Graph Embedding Networks for Binary Code Similarity against Compilation Diversity

Code2Img: Tree-Based Image Transformation for Scalable Code Clone Detection

SCDetector

Java Code Clone Detection by Exploiting Semantic and Syntax Information from Intermediate Code-Based Graph

Code Similarity in Clone Detection

TreeCen: Building Tree Graph for Scalable Semantic Code Clone Detection

Detection and Elimination of Similar Web Pages Based on Text Structure and String of Feature Code