Design of the Distributed Web Crawler

Xing Chen,Weijiang Li,Tiejun Zhao,Xinghai Piao
DOI: https://doi.org/10.4028/www.scientific.net/amr.204-210.1454
2011-01-01
Advanced Materials Research
Abstract:On the current scale of the Internet, the single web crawler is unable to visit the entire web in an effective time-frame. So, we develop a distributed web crawler system to deal with it. In our distribution design, we mainly consider two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. We focus on the distribution and parallel between nodes. We address two issues of the distributed web crawler which include the crawl strategy and dynamic configuration. The results of experiment show that the hash function based on the web site achieves the goal of the distributed web crawler. At the same time, we pursue the load balance of the system, we also should reduce the communication and management spending as much as possible.
What problem does this paper attempt to address?