GNP-based Scheduling Strategy for Distributed Crawling

刘爽,姜春祥,张伟哲,李东,张鸿
DOI: https://doi.org/10.3969/j.issn.1001-3695.2010.02.011
2010-01-01
Abstract:In order to solve task scheduling and load balancing problems of distributed search engines,this paper proposed a GNP-based scheduling strategy for distributed crawling and a load balancing method.Adopted internet distance estimating mechanism as a replacement for large-scale network distance measurement,which not only improved response time of the system,but also reduced WAN pressure caused by the system.Through deploying crawling nodes at WANs,built a distributed search engine,and implemented several scheduling strategies.The online experiment shows great improvement in system's performance.
What problem does this paper attempt to address?