Design and Research of Web Crawler Based on Distributed Architecture

Lili Wang,Haoliang Wang
DOI: https://doi.org/10.1145/3495018.3495061
2021-10-23
Abstract:Internet data is abundance in content and diverse in organization. In order to automatically complete the process of collecting, analyzing, and storing large amounts of data and information on the web, a web crawler technology based on Hadoop distributed clusters is proposed. Using Nutch crawler framework, Hadoop distributed technology, and Zookeeper distributed coordination service framework, through the construction of distributed clusters, and high-performance Key-Value database Redis to store data, it verifies the feasibility of distributed crawlers. Through the analysis of data collection experiments, the comparison of multiple sets of experimental data between the distributed crawler based on Nutch and the traditional crawler shows that the crawler design of the distributed architecture is superior to the traditional crawler in terms of collection speed and efficiency.
What problem does this paper attempt to address?