Inter-crawler communication optimization algorithms of distributed Web crawling system based on the content addressable network

Weizhe Zhang,Xiao Xu
2011-01-01
Abstract:In order to maintain the high coverage fraction and the low repetition rate, and balance the page load of each crawler, the distributed Web crawling system usually exchanges the URL (called "inter-link"), which incurs heavy inter-crawler communication loads. To reduce the inter-link total numbers, the link coordinate model based on the link relations of different Web hosts was proposed. Then, a novel inter-crawler communication optimization algorithm was put forward. The experiments based on five kinds of link analysis data sets prove that the algorithm has the good overall performance in the total inter-link number and the route expenses comparing to other algorithms.
What problem does this paper attempt to address?