The Study on Detecting Near-Duplicate WebPages

YuJuan Cao,ZhenDong Niu,WeiQiang Wang,Kun Zhao
DOI: https://doi.org/10.1109/cit.2008.4594656
2008-01-01
Abstract:Reprinting information among websites produces a great deal redundant WebPages. To improve search efficiency and user satisfaction, an algorithm to Detect near-Duplicate WebPages (DDW) is proposed. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we consider both syntactic and semantic information to present and compute documentspsila similarities. Second, after classifying web-pages into different categories, we index feature in each category then search for near-duplicates only in the same category. From Google searching results for 72 queries, we select 5835 near-duplicate WebPages manually. Then insert them into an existing collection which contains about 768,763 WebPages, as the test data. The experimental results demonstrate that our approach outperforms I-Match algorithms. In large-scale test, approximate linear time and space complexity are gotten.
What problem does this paper attempt to address?