Abstract:Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.

An Empirical Study to Evaluate Structural Similarity for Source Code Translation

Finding Plagiarism Based on Common Semantic Sequence Model

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

SemMT: A Semantic-based Testing Approach for Machine Translation Systems

Code Comments: A Way of Identifying Similarities in the Source Code

Flowchart-Based Cross-Language Source Code Similarity Detection

A Proposed Model for Source Code Reuse Detection in Computer Programs

Code Plagiarism Detection Method Based on Code Similarity and Student Behavior Characteristics

Evaluating Code Summarization with Improved Correlation with Human Assessment.

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Source-code Similarity Detection and Detection Tools Used in Academia

Quality Estimation & Interpretability for Code Translation

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

Use of Source Code Similarity Metrics in Software Defect Prediction

Academic Source Code Plagiarism Detection by Measuring Program Behavioural Similarity

SSMT:A Machine Translation Evaluation View to Paragraph-to-Sentence Semantic Similarity

A Code Similarity Detection Tool and Its Case Study

Code Search based on Context-aware Code Translation

Does BLEU Score Work for Code Migration?

hmCodeTrans: Human-Machine Interactive Code Translation