Abstract:The diversity of programming languages is growing, making the language extensibility of code clone detectors crucial. However, this is challenging for most existing clone detection detectors because the source code handler needs modifications, which require specialist-level knowledge of the targeted language and is time-consuming. Multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only. To address the shortcomings of existing multilingual detectors for language scalability and detection performance, we propose a multilingual code block extraction method based on ANTLR parser generation, and implement a multilingual code clone detector (MSCCD), which supports the most significant number of languages currently available and has the ability to detect Type-3 code clones. We follow the methodology of previous studies to evaluate the detection performance of the Java language. Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages. Furthermore, we propose the first multilingual syntactic code clone evaluation benchmark based on the CodeNet database. Our results reveal that even when applying the same detection approach, performance can vary markedly depending on the language of the source code under investigation. Overall, MSCCD is the most balanced one among the evaluated tools when considering detection performance and language extensibility.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to develop and evaluate a multilingual code clone detection technique to overcome the deficiencies of existing code clone detection tools in terms of language extensibility and detection performance. Specifically: 1. **Language Extensibility Problem**: - Existing code clone detection tools can usually only support a limited number of programming languages and it is difficult to quickly add support for new languages or new versions of existing languages. - Adding support for new programming languages requires modifying the source code processor program, which requires expert - level knowledge and is time - consuming. 2. **Detection Performance Problem**: - Existing tools perform poorly in detecting Type - 3 code clones (i.e., code fragments that are syntactically similar but different at the statement level). - The code clone detection performance of different languages varies significantly, and there is a lack of a general method to predict the performance of a certain technique on different languages. To solve these problems, the author proposed and implemented the Multilingual Syntactic Code Clone Detector (MSCCD), whose main features are as follows: - **Based on ANTLR Parser Generation**: MSCCD is implemented through the ANTLR parser generator. New language support can be easily added just by providing the ANTLR grammar definition file of the target language, without modifying the code of the tool itself. - **Supports Multiple Languages**: MSCCD can support the largest number of programming languages currently and can detect Type - 3 code clones. - **Performance Evaluation**: The author constructed two multilingual code clone detection evaluation benchmarks, which are respectively used to evaluate the recall rate and precision of four languages: Java, Python, C, and C++. - **Balance**: Compared with other multilingual code clone detection tools, MSCCD achieves the best balance between detection performance and language extensibility. Through these improvements, MSCCD not only improves the accuracy and extensibility of code clone detection but also provides an important reference and direction for future research.

Development and Benchmarking of Multilingual Code Clone Detector

Code Clone Detection: A Literature Review

AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection

Knowledge Distillation-Based Multilingual Code Retrieval

Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

Detecting Differences Across Multiple Instances of Code Clones

ZC3: Zero-Shot Cross-Language Code Clone Detection

SLACC: Simion-based Language Agnostic Code Clones

DCCD: an Efficient and Scalable Distributed Code Clone Detection Technique for Big Code

CMCD: Count Matrix Based Code Clone Detection

SCDetector

Investigating the Efficacy of Large Language Models for Code Clone Detection

CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection

Clone Detection on Large Scala Codebases

Nearest‐neighbor, BERT‐based, scalable clone detection: A practical approach for large‐scale industrial code bases

CLCD-I: Cross-Language Clone Detection by Using Deep Learning with InferCode

A novel code representation for detecting Java code clones using high-level and abstract compiled code representations

A Scalable and Accurate Approach Based on Count Matrix for Detecting Code Clones

Boreas: an Accurate and Scalable Token-Based Approach to Code Clone Detection

TGMM: Combining Parse Tree with GPU for Scalable Multilingual and Multi-Granularity Code Clone Detection