Method for Checking Duplicate Text of Network Piracy Based on Phoneme

Zhe-fan JIN,Ding-guo YU,Sheng-you LIN,Zhong-cheng ZHOU
DOI: https://doi.org/10.3969/j.issn.1000-2324.2017.03.029
2017-01-01
Abstract:The traditional method checking repetition takes a text as a participle to establish some key vectors, however the piratical cost may not be reasonable or necessary for the discovery of the online copyright violation in some special APP. Therefore this paper proposed a method checking repetition with Chinese phonology. A text was represented by three vectors in spaces of Chinese initial, final and tone and cosine distance was used as a measurement of similarity. Two decision models were proposed. One assumed the three vectors were independent each other, while the other took a linear combination of the three, which needed to calculate the factors using information entropies that could be evaluated by large-corpus counting. Training corpus was generated with the old term-vector/SimHash method being used as a standard and threshold values were calculated. Test results showed the proposed method had a good precision and a very good recall ratio, and computational cost was lowed comparing to traditional methods based on term vectors to be suitable for filtering out a large amount of TN documents.
What problem does this paper attempt to address?