Abstract:If we are to use electronic plagiarism detectors on student work, it would be interesting to know how much similarity should be expected in independently written documents on a similar topic. If our measure is coarse, the answer should be zero, but a finer grained analysis (such as would be needed to detect inadequate paraphrasing) is likely to detect some background noise. How much background noise should there be? We would like to determine this, but it is hard to publish research based on analysis of student work, because we cannot know whether any particular pair of students worked completely independently or not, and in any case the results might attract unwelcome publicity. To get an estimate of an appropriate level of this background noise, we analysed submissions to an international conference using the Ferret plagiarism detector developed by Lyon et al. (2001). Ferret provides very fast and fine-grained similarity detection in moderately large collections of documents. This was an exercise intra-corporal or collusion detection rather than comparison to Web sources. For this purpose the Ferret algorithm is well suited. There were 483 files; scanning the files took about 50 seconds, and calculating the similarity statistics took about 10 seconds. There were 116403 file pairs. Of these pairs, only 116 (0.1%) had more than 99 common triples, and of these only 19 pairs (0.016% of the total) had over 200 matching triples (200 is about 10% of the typical size of the smaller files). There should be NO plagiarism here, as these are published conference papers, but in fact the top few are all pairs of papers with common authors, and they have re-used text. A simple MS Word file compare between one of the top ranking pairs is sufficient to make the similarities obvious (though Word does not highlight all the similarities by any means). Nevertheless, as expected, most document pairs showed very low similarity measures, and this was consistent across the vast majority of pairs. As noted, there was a surprisingly large degree of similarity in just a few cases. We accordingly investigated these pairs more carefully. The worst case was of an author who had submitted two papers. Each paper reported the results of a single experiment, but the background material for both experiments was very much the same and he had simply reproduced the same text in both papers. We present also the other cases where similarity was high, and ponder the implications for routine scanning of student work.

A Comparison of Document Similarity Algorithms

An adaptive method for text domain similarity calculation

Visualizing Document Similarity

The Study on the Comprehensive Computation of the Documents Similarity

Similarity algorithm of text based on semantic understanding

Research on applicability of sentence similarity algorithms in text copy detection

Text similarity in academic conference papers

Document Similarity for Texts of Varying Lengths via Hidden Topics

Document similarity search based on generic summaries

A survey on the techniques, applications, and performance of short text semantic similarity

Comprehensive Similarity Measurement Model Based on Three Algorithms

A Comparative Analysis of Temporal Long Text Similarity: Application to Financial Documents

Measurement of Text Similarity: A Survey

Comparison study of unsupervised paraphrase detection: Deep learning—The key for semantic similarity detection

Comparison of document similarity measurements in scientific writing using Jaro-Winkler Distance method and Paragraph Vector method

SimDoc: Topic Sequence Alignment based Document Similarity Framework

A Novel Linguistic Phenomenon Description for Text Similarity Computing

Papers' similarity based on the summarization merits

A Combined Measure for Text Semantic Similarity

Calculating Similarity of Javadoc Comments

Similarity Measure Based on Improved Optimal Assignment Model