Implementation of Distributed Crawler System Based on Spark for Massive Data Mining

Feng Liu,Wang Xin
DOI: https://doi.org/10.1109/icccs49078.2020.9118442
2020-01-01
Abstract:In the era of rapid development of Internet technology and increasing social needs of people, web crawlers have been maturely applied to major search engines and search fields. By using Spark's RDD elastic computing architecture and task assignment algorithm, this paper integrates the architecture of Spark-based distributed crawler system, gives the corresponding framework diagram, and introduces the distributed framework system in detail. Through this Spark-based distributed crawler system we can solve the problem of insufficient resource utilization and low collection efficiency, and then solve the contradiction between the current explosive growth of data scale and the speed of obtaining information.
What problem does this paper attempt to address?