Optimal bandwidth allocation for web crawler systems with time constraints

Weiping Zhu,Yaodong Li,Shu Li,Yi Xu,Xiaohui Cui
DOI: https://doi.org/10.1007/s12652-020-02377-1
IF: 3.662
2020-08-25
Journal of Ambient Intelligence and Humanized Computing
Abstract:Web crawler is an important tool to obtain information from the Internet in a timely manner. In a typical web crawler system with limited bandwidth, many websites are crawled with different time constraints. Existing studies regarding web crawler systems do not consider the bandwidth allocation in such a complex environment; hence, the time constraints may not be satisfied. In this study, we investigate the bandwidth allocation approaches for such a web crawler system. The approaches are designed for two scenarios, i.e., when the number of websites exceeds or does not exceed the maximum number of web crawlers that the system can execute simultaneously. For the latter situation, we propose approaches to control the bandwidth for web crawlers to minimize the maximum complete time or minimize the sum of execution times of all web crawlers, considering assumptions of both sufficient and insufficient bandwidths. For the former situation, we propose a round-based reallocation approach to schedule both the sequence and bandwidth allocation of the web crawlers. Extensive simulations are conducted to validate the proposed approaches, and the results show that our approaches satisfy the time constraints well and achieve desirable execution performances in various scenarios.
computer science, information systems,telecommunications, artificial intelligence
What problem does this paper attempt to address?