Design and Implementation of a Scalable Distributed Web Crawler Based on Hadoop

YuLiang Shi,Ti Zhang
DOI: https://doi.org/10.1109/icbda.2017.8078691
2017-01-01
Abstract:In this article, an efficient and scalable distributed web crawler system based on Hadoop will be design and implement. In the paper, firstly the application of cloud computing in reptile field is introduced briefly, and then according to the current status of the crawler system, the specific use of Hadoop distributed and cloud computing features detailed design of a highly scalable crawler system, and finally the system Data statistics, under the same conditions, compared with the existing mature system, it is clear that the superiority of distributed web crawler. This advantage in the context of large data era of massive data is particularly important to climb.
What problem does this paper attempt to address?