Abstract:Recovering information from a targeted website that undergoes dynamic changes is a complicated undertaking. It necessitates the use of a highly efficient web crawler by search engines. In this study, we merged two web crawlers: Selenium with parallel computing capabilities and Scrapy , to gather electron molecular collision cross-section data from the National Fusion Research Institute ( NFRI ) database. The method effectively combines static and dynamic web crawling. The primary challenges lie in the time-consuming nature of dynamic web crawling using Selenium and that Scrapy 's limited support for parallel computing within the "download middleware". Nevertheless, this combined approach proves exceptionally well-suited for the task of data extraction from an online database, which comprises multiple web pages with unchanging URLs when specific keywords are submitted. We applied natural language processing techniques to identify species and dissect reaction formulas into various states. Employing these methodologies, we extracted a total of 76,893 data points pertaining to 112 species. These data pieces offer intricate insights into the processes unfolding within the plasma, all collected within a span of ten minutes. When compared to traditional web crawling methods, our approach boasts a speed advantage of roughly 100 times faster than dynamic web crawlers and exhibits greater flexibility than static web crawlers. In this report, we present the retrieved results, encompassing reaction formulas, reference information, species metadata, and time comparison among various methods.

A novel multi-threaded web crawling model

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

A Thread-wise Strategy for Incremental Crawling of Web Forums

Incorporating Site-Level Knowledge For Incremental Crawling Of Web Forums: A List-Wise Strategy

A Dynamic Reconfiguration Model for a Distributed Web Crawling System

Towards A Quality-Oriented Real-Time Web Crawler

Exploring Traversal Strategy for Web Forum Crawling

Design and Research of Web Crawler Based on Distributed Architecture

A cognitive crawler using structure pattern for incremental crawling and content extraction

A Sample-Guided Approach to Incremental Structured Web Database Crawling

Incremental Structured Web Database Crawling Via History Versions

Architectural Design and Evaluation of an Efficient Web-Crawling System

WebParF: A Web partitioning framework for Parallel Crawlers

System Model of Incremental Spider for the Chinese Web and Its Implementation

A novel combining method of dynamic and static web crawler with parallel computing

Schedule Web Forum Crawling with a Freshness-First Strategy

Schema driven and topic specific web crawling

Anywhere: A Web Crawler Automation Management Interface

Implementation of Web Data Mining Technology Based on Python

EasySpider: A No-Code Visual System for Crawling the Web

A hunger-based scheduling strategy for distributed crawler