Utilizing Text Similarity Measurement for Data Compression to Detect Plagiarism in Czech

Hussein Soori,Michal Prilepok,Jan Platos,Václav Snášel
DOI: https://doi.org/10.1007/978-3-319-13572-4_13
2015-01-01
Abstract:This paper attempts to apply data compression based similarity method for plagiarism detection. The method has been used earlier for plagiarism detection for Arabic and English languages. In this paper we utilize this method for Czech language text from a local multi-domain Czech corpus with 50 original documents with non-plagiarized parts, and 100 suspicious documents. The documents were generated so that every document could have from 1 to 5 paragraphs. The suspicion rate in the documents was randomly chosen from 0.2 to 0.8. The findings of the study show that the similarity measurement based on Lempel-Ziv comparison algorithms is efficient for the plagiarized part of the Czech text documents with a success rate of 82.60%. Future studies may enhance the efficiency of the algorithms by including combined and more sophisticated methods.
What problem does this paper attempt to address?