Abstract:In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website's templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages' modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.

Design and Research of Web Crawler Based on Distributed Architecture

Implementation of Distributed Crawler System Based on Spark for Massive Data Mining

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

A Distributed Data Mining System Framework for Mobile Internet Access Log Based on Hadoop.

Architectural Design and Evaluation of an Efficient Web-Crawling System

Implementation of large-scale distributed information retrieval system

Implementation of Web Data Mining Technology Based on Python

The Implementation of Hadoop-based Crawler System and Graphlite-based PageRank-Calculation In Search Engine

Design and Implementation of Crawler Program Based on Python

Design and Implementation of Craweper Based on Scrapy

Structured processing method of distributed network information

A hunger-based scheduling strategy for distributed crawler

Summary of web crawler technology research

Data Crawling and Research Based on Topic Web Crawler

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Design and research of big data technology based on e-commerce platform

A Distributed Text Mining System for Online Web Textual Data Analysis

An Implementation and Optimization for Scalable DHT Crawler

A Dynamic Reconfiguration Model for a Distributed Web Crawling System

Towards A Quality-Oriented Real-Time Web Crawler

A cognitive crawler using structure pattern for incremental crawling and content extraction