On Distributed Web Crawler: Architecture, Algorithms and Strategy

叶允明,于水,马范援,宋晖,张岭
DOI: https://doi.org/10.3321/j.issn:0372-2112.2002.z1.023
2002-01-01
Abstract:We describe a large - scale distributed Web Crawler system, i.e. Igloo VI.2. Igloo' s distributed architecture is based on our two-tiered Hash mapping algorithm, so that it can do efficient task partition while at the same time providing dynamic scalability. As the quality of crawled Web pages is an important factor for evaluating crawlers, it employs PageRank value as the evaluation metric of pages to improve its crawling efficiency. This paper also provides a detailed discussion of the performance bottlenecks in crawler systems,and proposes a new URL repository access method based on delayed merging' strategy to enable high-speed crawling. The experiments show Igloo can quickly crawl high-quality Web pages as well as present high performance.
What problem does this paper attempt to address?