Finding Plagiarism Based on Common Semantic Sequence Model

Jun-peng Bao,Jun-yi Shen,Xiao-dong Liu,Hai-yan Liu,Xiao-di Zhang
DOI: https://doi.org/10.1007/978-3-540-27772-9_66
2004-01-01
Abstract:It is one of key problems in Text Mining to find document features. The string matching model and global word frequency model are two common models. But the former can hardly resist rewording noise, whereas the latter cannot find document details. We present Common Semantic Sequence Model (CSSM) and apply it to Document Copy Detection. CSSM combines the ideas of 2 models above, and it makes a trade-off between a document global features and local features. CSSM calculates the common words proportion between 2 documents semantic sequences to make a plagiarism score. A semantic sequence is indeed a continual word sequence after the low-density words are omitted. With the collection of 2 documents semantic sequences, we can detect plagiarism in a fine granularity. We test CSSM with several common copy types. The result shows that CSSM is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.
What problem does this paper attempt to address?