A GNP-Based Scheduling Strategy for Distributed Crawling

Shuang Liu,Xiao Xu,Dong Li,Wei-zhe Zhang,Xin-ran Liu
DOI: https://doi.org/10.1109/wism.2009.136
2009-01-01
Abstract:In order to solve task scheduling and load balancing problems of distributed search engines, a GNP-based scheduling strategy for distributed crawling and a load balancing method are proposed in this paper. Internet distance estimating mechanism is adopted as a replacement for large-scale network distance measurement, which not only improves response speed of the system, but also reduces loads on WAN caused by the system. Through deploying crawling nodes at WANs, we built a distributed search engine, and implemented several scheduling strategies. The online experiment shows great improvement in system’s performance.
What problem does this paper attempt to address?