A binary-tree based algorithm for online duplicate documents detection

Zuoda D. Liu,Jiuling Zhang,Xing Li
2009-01-01
Journal of Information and Computational Science
Abstract:The redundancy of web information is growing rapidly with the development of Internet. The research on duplicate detection is ongoing and new methods are strongly needed. In this paper, we propose a novel algorithm for online duplicate documents detection which has four main features. First, it largely reduces the computational complexity especially for a huge amount of documents. Second, it has a high precision according to our practical experiments. Third, it is a kind of dynamic detection which can work continuously while increasing new documents. Last, it is self-adaptive and has little parameters. This approach is suitable for information retrieval as well as other applications dealt with document processing. 1548-7741/ Copyright © 2009 Binary Information Press.
What problem does this paper attempt to address?