Abstract:Cross-lingual plagiarism (CLP) occurs when texts written in one language are translated into a different language and used without acknowledging the original sources. One of the most common methods for detecting CLP requires online machine translators (such as Google or Microsoft translate) which are not always available, and given that plagiarism detection typically involves large document comparison, the amount of translations required would overwhelm an online machine translator, especially when detecting plagiarism over the web. In addition, when translated texts are replaced with their synonyms, using online machine translators to detect CLP would result in poor performance. This paper addresses the problem of cross-lingual plagiarism detection (CLPD) by proposing a model that uses simulated word embeddings to reproduce the predictions of an online machine translator (Google translate) when detecting CLP. The simulated embeddings comprise of translated words in different languages mapped in a common space, and replicated to increase the prediction probability of retrieving the translations of a word (and their synonyms) from the model. Unlike most existing models, the proposed model does not require parallel corpora, and accommodates multiple languages (multi-lingual). We demonstrated the effectiveness of the proposed model in detecting CLP in standard datasets that contain CLP cases, and evaluated its performance against a state-of-the-art baseline that relies on online machine translator (T+MA model). Evaluation results revealed that the proposed model is not only effective in detecting CLP, it outperformed the baseline. The results indicate that CLP could be detected with state-of-the-art performances by leveraging the prediction accuracy of an internet translator with word embeddings, without relying on internet translators.

Finding Plagiarism Based on Common Semantic Sequence Model

Semantic Sequence Kin: A Method of Document Copy Detection

Semantic Measure of Plagiarism Using a Hierarchical Graph Model

Automatic Detection of Plagiarism in Writing

Analyzing Non-Textual Content Elements to Detect Academic Plagiarism

Methods for Detecting Paraphrase Plagiarism

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Identifying Machine-Paraphrased Plagiarism

Plagiarism Detection using ROUGE and WordNet

An effective text plagiarism detection system based on feature selection and SVM techniques

Features Based Text Similarity Detection

Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to Document Level

An Intelligent Approach for Semantic Plagiarism Detection in Scientific Papers

Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques

Taxonomy of Mathematical Plagiarism

Plagiarism Detection Using Machine Learning

Detecting Cross-Lingual Plagiarism Using Simulated Word Embeddings

Music Plagiarism Detection via Bipartite Graph Matching

Plagiarism Judgment Based on Language Model and Feature Classification

Methods for identifying versioned and plagiarized documents

A Proposed Model for Source Code Reuse Detection in Computer Programs