Abstract:If we are to use electronic plagiarism detectors on student work, it would be interesting to know how much similarity should be expected in independently written documents on a similar topic. If our measure is coarse, the answer should be zero, but a finer grained analysis (such as would be needed to detect inadequate paraphrasing) is likely to detect some background noise. How much background noise should there be? We would like to determine this, but it is hard to publish research based on analysis of student work, because we cannot know whether any particular pair of students worked completely independently or not, and in any case the results might attract unwelcome publicity. To get an estimate of an appropriate level of this background noise, we analysed submissions to an international conference using the Ferret plagiarism detector developed by Lyon et al. (2001). Ferret provides very fast and fine-grained similarity detection in moderately large collections of documents. This was an exercise intra-corporal or collusion detection rather than comparison to Web sources. For this purpose the Ferret algorithm is well suited. There were 483 files; scanning the files took about 50 seconds, and calculating the similarity statistics took about 10 seconds. There were 116403 file pairs. Of these pairs, only 116 (0.1%) had more than 99 common triples, and of these only 19 pairs (0.016% of the total) had over 200 matching triples (200 is about 10% of the typical size of the smaller files). There should be NO plagiarism here, as these are published conference papers, but in fact the top few are all pairs of papers with common authors, and they have re-used text. A simple MS Word file compare between one of the top ranking pairs is sufficient to make the similarities obvious (though Word does not highlight all the similarities by any means). Nevertheless, as expected, most document pairs showed very low similarity measures, and this was consistent across the vast majority of pairs. As noted, there was a surprisingly large degree of similarity in just a few cases. We accordingly investigated these pairs more carefully. The worst case was of an author who had submitted two papers. Each paper reported the results of a single experiment, but the background material for both experiments was very much the same and he had simply reproduced the same text in both papers. We present also the other cases where similarity was high, and ponder the implications for routine scanning of student work.

Utilizing Text Similarity Measurement for Data Compression to Detect Plagiarism in Czech

The Struggle with Academic Plagiarism: Approaches based on Semantic Similarity

Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques

Using Sentence Similarity Measure for Plagiarism Detection of Arabic Documents

Testing of support tools for plagiarism detection

Finding Plagiarism Based on Common Semantic Sequence Model

Support for checking plagiarism in e-learning

Source-code Similarity Detection and Detection Tools Used in Academia

Similarity Check to Detect Text Data Plagiarism

Analyzing Non-Textual Content Elements to Detect Academic Plagiarism

Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach

Text similarity in academic conference papers

Features Based Text Similarity Detection

Plagiarism Detection on Electronic Text based Assignments using Vector Space Model (ICIAfS14)

A Hybrid Method for Detecting Source-code Plagiarism in Computer Programming Courses

An Intelligent Approach for Semantic Plagiarism Detection in Scientific Papers

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Measuring Plagiarism in Introductory Programming Course Assignments

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection

Deep Investigation of Cross-Language Plagiarism Detection Methods

Automatic Detection of Plagiarism in Writing