Improved fuzzy set information retrieval approach on duplicate webpage detection

Yuchen Zhou,Zuoda D. Liu,Beixing Deng,Xing Li
2009-01-01
Journal of Information and Computational Science
Abstract:Similar Web pages are easily found on Internet. The redundancy of information severely slows down internet applications such as crawl module of search engine, and could lead to waste of storage in the indexing procedure. In this paper, we proposed a content-based approach for detecting webpage duplications. The algorithm contains three parts: i) pre-processing, excluding HTML tags and unrelated information; ii) use a query-combined fuzzy set information retrieval approach to find out the correlation between every two documents; iii) a threshold is set and duplicate webpages are eliminated. Original algorithm of duplication detection is revised and focused mainly on performance optimization. Testing results shows that the performance is greatly improved with an acceptable sacrifice on quality. Copyright ©2009 Binary Information Press.
What problem does this paper attempt to address?