Duplicate Web Page Elimination Based on Bloom Filter

Xu Na,Liu Siwei,Wang Xiang,Ni WeiMing
DOI: https://doi.org/10.3969/j.issn.1007-757X.2011.03.016
2011-01-01
Abstract:There are many duplicated web pages in the internet, which will make data mining and information retrieval more difficult. In this paper, we analysis the disadvantage of current algorithm, and propose a new algorithm to eliminate duplicated web pages based on Bloom Filter. We use existed refining algorithm to pre-process the web pages, and reduce the running time and stored space using Bloom Filter to process duplicated web pages. This paper use long sentences to represent features of web pages, and change the elimination process into a search process, so as to eliminate the running time using Bloom Filter.
What problem does this paper attempt to address?