Research and Evaluation of Near-replicas of Web Pages Detection Algorithms

Jian-yong WANG,Zheng-mao XIE,Ming LEI,Xiao-ming LI
DOI: https://doi.org/10.3321/j.issn:0372-2112.2000.Z1.033
2000-01-01
Tien Tzu Hsueh Pao/Acta Electronica Sinica
Abstract:Many documents are replicated across the World-wide Web. How to efficiently and accurately find the near-replicas of web pages becomes an important topic in the search engine research area, which can be used to improve the quality of searching service. We propose five near-replicas detection algorithms for search engines that rely on keyword matching, and evaluate them using the WebGather search engine system. In addition, we also compare our method with one of the most popular copy detection mechanisms. Our method is successfully adopted to remove the near-replicas of web pages in WebGather, and it can also be widely used to build library.
What problem does this paper attempt to address?