Anywhere: A Web Crawler Automation Management Interface

Jinwei Lin
2024-05-10
Abstract:Web crawling projects or design is significant in the current information age. Using the web spider or crawler can automatically search and collect a huge amount of internet information. As one of the most popular web crawler frameworks, Scrapy is robust in abundant functions but weak in easy operation. In this paper, we provide a framework Anywhere, for optimising the usage feeling and improving the use efficiency of the web crawling management of Scrapy. We analyse the whole workflow of a web crawling project of Scrapy and design two main functions in Anywhere, one is quickly generating a Scrapy project with the preset temperatures, the other is repeatable configuration function for the Scrapy project setting. Beside, with Anywhere, users can easily directly manage multiple Scrapy projects with a file folders architecture. Compared with normal Scrapy project interactive coding development, we test Anywhere with enough experiments that show Anywhere can improve the development efficiency of Scrapy projects to about 200\%. For the multiple project management in code interaction level, the developing efficiency is improved to about 300\%. We simplify the procedure to quickly generate a simple spider project with Scrapy. Anywhere can assist the development of Scrapy is useful for the design of large batch concurrent projects at coding level.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper presents a web crawler automation management framework called Anywhere, aiming to optimize the user experience and productivity of Scrapy, a popular Python web scraping framework. Despite its powerful features, Scrapy lacks in terms of usability. Anywhere analyzes the project workflow of Scrapy and designs two main functionalities: fast generation of pre-configured Scrapy projects and repeatable configuration of Scrapy project settings. Furthermore, Anywhere allows users to easily manage multiple Scrapy projects through a folder structure. In comparison to traditional interactive coding development with Scrapy projects, Anywhere experimentally demonstrates approximately a 200% improvement in development efficiency and a 300% efficiency boost in the code interaction layer for multiple project management. Anywhere simplifies the process of creating simple scrapy projects and, as a tool separate from Scrapy, its design principles and solutions can be applied to other application domains. The paper also reviews related literature, discusses the challenges of using Scrapy, other web scraping frameworks, and methods to enhance Scrapy efficiency. In conclusion, the paper aims to address the limitations of Scrapy in terms of usability and efficiency, providing a more convenient and efficient web crawler project management and development environment through Anywhere.