Workload-Aware Web Crawling and Server Workload Detection

Shaozhi Ye,Guohan Lu,Xing Li
2004-01-01
Abstract:With the development of search engines, more and more web crawlers are used to gather web pages. The rising crawl- ing tra-c has brought the concern that crawlers may im- pact web sites. On the other hand, more e-cient crawl- ing strategy is required for the coverage and freshness of search engine index. In this paper, crawlers of several ma- jor search engines are analyzed using one six-months access log of a busy web site. Surprisingly, we flnd that none of these crawlers pays attention to the workload of web site, which may hurt both server performance and crawling e-- ciency. Based on this observation, a server workload-aware crawling strategy is proposed. By measuring the web service time with a hybrid back-to-back packets pair, server work- load is detected on the client side, thus crawler can adapt its crawling speed to web server. The experiment results show the power of our workload detection approach. This paper concludes with a discussion of future work on server workload detection and its applications.
What problem does this paper attempt to address?