Development and Benchmarking of Multilingual Code Clone Detector

Wenqing Zhu,Norihiro Yoshida,Toshihiro Kamiya,Eunjong Choi,Hiroaki Takada
2024-09-17
Abstract:The diversity of programming languages is growing, making the language extensibility of code clone detectors crucial. However, this is challenging for most existing clone detection detectors because the source code handler needs modifications, which require specialist-level knowledge of the targeted language and is time-consuming. Multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only. To address the shortcomings of existing multilingual detectors for language scalability and detection performance, we propose a multilingual code block extraction method based on ANTLR parser generation, and implement a multilingual code clone detector (MSCCD), which supports the most significant number of languages currently available and has the ability to detect Type-3 code clones. We follow the methodology of previous studies to evaluate the detection performance of the Java language. Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages. Furthermore, we propose the first multilingual syntactic code clone evaluation benchmark based on the CodeNet database. Our results reveal that even when applying the same detection approach, performance can vary markedly depending on the language of the source code under investigation. Overall, MSCCD is the most balanced one among the evaluated tools when considering detection performance and language extensibility.
Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to develop and evaluate a multilingual code clone detection technique to overcome the deficiencies of existing code clone detection tools in terms of language extensibility and detection performance. Specifically: 1. **Language Extensibility Problem**: - Existing code clone detection tools can usually only support a limited number of programming languages and it is difficult to quickly add support for new languages or new versions of existing languages. - Adding support for new programming languages requires modifying the source code processor program, which requires expert - level knowledge and is time - consuming. 2. **Detection Performance Problem**: - Existing tools perform poorly in detecting Type - 3 code clones (i.e., code fragments that are syntactically similar but different at the statement level). - The code clone detection performance of different languages varies significantly, and there is a lack of a general method to predict the performance of a certain technique on different languages. To solve these problems, the author proposed and implemented the Multilingual Syntactic Code Clone Detector (MSCCD), whose main features are as follows: - **Based on ANTLR Parser Generation**: MSCCD is implemented through the ANTLR parser generator. New language support can be easily added just by providing the ANTLR grammar definition file of the target language, without modifying the code of the tool itself. - **Supports Multiple Languages**: MSCCD can support the largest number of programming languages currently and can detect Type - 3 code clones. - **Performance Evaluation**: The author constructed two multilingual code clone detection evaluation benchmarks, which are respectively used to evaluate the recall rate and precision of four languages: Java, Python, C, and C++. - **Balance**: Compared with other multilingual code clone detection tools, MSCCD achieves the best balance between detection performance and language extensibility. Through these improvements, MSCCD not only improves the accuracy and extensibility of code clone detection but also provides an important reference and direction for future research.