Research on method to detect reduplicative Chinese short texts

Xiang GAO,Bing LI
DOI: https://doi.org/10.3778/j.issn.1002-8331.1309-0424
2014-01-01
Abstract:The article presents an effective algorithm framework for text de-duplication, focusing on redundancy problem of Chinese short texts. In view of the brevity and huge volumes of short texts, Bloom Filter have been introduced, Trie tree and the SimHash algorithm have been introduced. In the first stage of the algorithm framework, Bloom Filter or Trie tree is designed to remove duplications completely;in the second stage, the SimHash algorithm is used to detect similar duplications. This text has designed the parameters used in the algorithm framework, and the feasibility and rationality is testified.
What problem does this paper attempt to address?