Exploiting Location-aware Mechanism for Distributed Web Crawling over DHTs.

Xiao Xu,Weizhe Zhang,Hongli Zhang,Binxing Fang
DOI: https://doi.org/10.4304/jcp.5.11.1646-1654
2010-01-01
Abstract:Inspired by the concept of internet computing, DHT-based distributed Web crawling model is proposed to solve the bottlenecks of the traditional Web crawling systems. Based on this system model, we propose optimizations to reduce the download time of the Web crawling tasks in order to increase the efficiency of the system. The improvement on the download time is achieved by shortening the crawler-crawlee network distance. By utilizing the mapping mechanism of Content Addressable Network (CAN) over Network Coordinate System (NC), the issue can be mapped onto a problem of minimizing the distances between peers and resources on the DHT overlay. This paper focuses on reducing such distances, seeking to provide an improved location-aware infrastructure for distributed Web crawling. A new DHT-based distributed Web crawling model is proposed first. Then, under this model, a new method based on CAN’s splitting schemes is proposed which shows a significant decrease in crawler-crawlee distance against existing schemes. In addition, the issue of load balancing is also solved by combining the new method with old ones.
What problem does this paper attempt to address?