Exploring Web Partition in DHT-Based Distributed Web Crawling.

Xiao Xu,Weizhe Zhang,Hongli Zhang,Binxing Fang
DOI: https://doi.org/10.1587/transinf.e93.d.2907
2010-01-01
IEICE Transactions on Information and Systems
Abstract:The basic requirements of the distributed Web crawling systems are short download time low communication overhead and balanced load which largely depends on the systems Web partition strategies In this paper we propose a DHT based distributed Web crawling system and several DHT based Web partition methods First a new system model based on a DHT method called the Content Addressable Network (CAN) is proposed Second based on this model a network distance based Web partition is implemented to reduce the crawler crawlee network distance in a fully distributed manner Third by utilizing the locality on the link space we propose the concept of link based Web partition to reduce the communi cation overhead of the system This method not only reduces the number of inter links to be exchanged among the crawlers but also reduces the cost of routing on the DHT overlay In order to combine the benefits of the above two Web partition methods we then propose 2 distributed multi objective Web partition methods Finally all the methods we propose in this paper are compared with existing system models in the simulated experiments under different datasets and different system scales In most cases the new methods show their superiority
What problem does this paper attempt to address?