A novel multi-threaded web crawling model

Weijie.Jiang
2024-05-09
Abstract:This paper proposes a novel model for web crawling suitable for large-scale web data acquisition. This model first divides web data into several sub-data, with each sub-data corresponding to a thread task. In each thread task, web crawling tasks are concurrently executed, and the crawled data are stored in a buffer queue, awaiting further parsing. The parsing process is also divided into several threads. By establishing the model and continuously conducting crawler tests, it is found that this model is significantly optimized compared to single-threaded approaches.
Databases
What problem does this paper attempt to address?
The paper attempts to address the inefficiency problem of traditional single-threaded web crawlers in large-scale network data scraping tasks. The authors propose a new multi-threaded web crawler model, aiming to fully utilize computing resources and accelerate data retrieval speed by concurrently executing multiple crawling tasks. Specifically, the model divides the network data into several sub-data, each corresponding to a thread task; within each thread task, the crawling tasks are executed in parallel, and the scraped data is stored in a buffer queue awaiting further parsing. The parsing process is also divided into multiple threads for processing. Through comparative experiments with single-threaded crawlers, it was found that the multi-threaded model has significant advantages in handling large-scale datasets, greatly reducing the time required for data scraping, thereby improving overall performance. The experimental results show that under specific conditions, the new model optimized by 81.11% compared to the single-threaded crawler, demonstrating its potential and value in practical applications.