Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine

雷凯,王东海
DOI: https://doi.org/10.3969/j.issn.1000-3428.2008.13.029
2008-01-01
Abstract:An Implementation of incremental Web Crawler that supports update of search engine over millions of Web pages on daily basis is introduced. With analysis on the weakness of traditional periodic Crawler and difficulties in incremental Web Crawler, this paper presents key strategies on prediction of Web evolution, algorithms of locating changed Web pages based on MD5, URL scheduling and caching, describes the implementation, and evaluates the Crawler system. The incremental crawler has been integrated with TianWang search engine at Peking University for 6 months. Update cycle is reduced by 20 days, accuracy of evolution prediction reaches 79.4%, and real-time efficiency, extendibility and stability are improved.
What problem does this paper attempt to address?