Abstract:Distributed crawling is one of the mainstream text data collection technologies, which is essential for mining boundless data available on the Internet for users. Internet information is clustered by the correlation of keywords, and users employ search engines to retrieve relevant keywords to get the information they care about. Distributed crawlers are used to mine the Internet information by simulating users' behavior, the more important keywords that users care about, the higher the correlation of data to keywords. In order to preferentially collect information that users care about with minimal resource consumption, in this paper, we design a scheduling framework and propose a novel scheduling strategy based on hunger for distributed crawler. We first define the load capacity of distributed crawler as hunger which reflects the ability to complete tasks and divide keywords queues into sub-queues based on the hunger of distributed crawlers. Then, we use vector space model and cosine similarity algorithm to learn the correlation of keywords to text data and apply optimized logistic algorithm to measure the importance of keywords. Meanwhile, we design a comprehensive evaluation algorithm to quantify the contribution of keywords, so that updating sub-queues order. Finally, new sub-queues are used in the deeper scheduling to preferentially get data that users desire and sacrifice the least number of resources. Experimental results demonstrate that our method optimizes the scheduling procedures and makes crawling more efficient with less run time.

BUbiNG: Massive Crawling for the Masses

Analysis of a Statistical Hypothesis Based Learning Mechanism for Faster crawling

Analysis of Statistical Hypothesis based Learning Mechanism for Faster Crawling

Implementation of large-scale distributed information retrieval system

Design and Research of Web Crawler Based on Distributed Architecture

Crowd Crawling

An Automatic and Scalable Application Crawler for Large-Scale Mobile Internet Content Retrieval.

Effective performance of information retrieval on web by using web crawling

Architectural Design and Evaluation of an Efficient Web-Crawling System

WebParF: A Web partitioning framework for Parallel Crawlers

A Brief History of Web Crawlers

A hunger-based scheduling strategy for distributed crawler

A novel multi-threaded web crawling model

The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine

Implementation of Distributed Crawler System Based on Spark for Massive Data Mining

A novel focused crawler based on breadcrumb navigation

Towards A Quality-Oriented Real-Time Web Crawler

LEARNING-based Focused WEB Crawler

Smart Bilingual Focused Crawling of Parallel Documents

An Efficient Adaptive Focused Crawler Based on Ontology Learning