Abstract:Distributed crawling is one of the mainstream text data collection technologies, which is essential for mining boundless data available on the Internet for users. Internet information is clustered by the correlation of keywords, and users employ search engines to retrieve relevant keywords to get the information they care about. Distributed crawlers are used to mine the Internet information by simulating users' behavior, the more important keywords that users care about, the higher the correlation of data to keywords. In order to preferentially collect information that users care about with minimal resource consumption, in this paper, we design a scheduling framework and propose a novel scheduling strategy based on hunger for distributed crawler. We first define the load capacity of distributed crawler as hunger which reflects the ability to complete tasks and divide keywords queues into sub-queues based on the hunger of distributed crawlers. Then, we use vector space model and cosine similarity algorithm to learn the correlation of keywords to text data and apply optimized logistic algorithm to measure the importance of keywords. Meanwhile, we design a comprehensive evaluation algorithm to quantify the contribution of keywords, so that updating sub-queues order. Finally, new sub-queues are used in the deeper scheduling to preferentially get data that users desire and sacrifice the least number of resources. Experimental results demonstrate that our method optimizes the scheduling procedures and makes crawling more efficient with less run time.

Design and Implementation of a Scalable Distributed Web Crawler Based on Hadoop

Design and Research of Web Crawler Based on Distributed Architecture

A Distributed Data Mining System Framework for Mobile Internet Access Log Based on Hadoop.

Implementation of Distributed Crawler System Based on Spark for Massive Data Mining

Implementation of large-scale distributed information retrieval system

The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine

Architectural Design and Evaluation of an Efficient Web-Crawling System

Design and Implementation of Clinical Data Integration and Management System Based on Hadoop Platform

Implementation of Web Data Mining Technology Based on Python

Design and Implementation of Craweper Based on Scrapy

Parallel Approach and Platform for Large-Scale WEB Data Extraction

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Design and research of big data technology based on e-commerce platform

Design and Implementation of Crawler Program Based on Python

Power Big Data Analysis Platform Design Based on Hadoop

Web Crawler: Design And Implementation For Extracting Article-Like Contents

The performance of MapReduce: an in-depth study

Research of urban traffic carbon emission data mining based on Hadoop

Design and Implementation of Log Data Analysis Management System Based on Hadoop

Summary of web crawler technology research

A hunger-based scheduling strategy for distributed crawler