Abstract:In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website's templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages' modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.

Web Evolution and Incremental Crawling

Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine

The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling

System Model of Incremental Spider for the Chinese Web and Its Implementation

Probabilistically ranking web article quality based on evolution patterns

Towards A Quality-Oriented Real-Time Web Crawler

A cognitive crawler using structure pattern for incremental crawling and content extraction

Incremental Structured Web Database Crawling Via History Versions

Incorporating Site-Level Knowledge For Incremental Crawling Of Web Forums: A List-Wise Strategy

A Thread-wise Strategy for Incremental Crawling of Web Forums

A Sample-Guided Approach to Incremental Structured Web Database Crawling

An Efficient Valid Page Crawling Approach for Websites with Dynamic Scripts

A Brief History of Web Crawlers

Analysis of Statistical Hypothesis based Learning Mechanism for Faster Crawling

Analysis of a Statistical Hypothesis Based Learning Mechanism for Faster crawling

Effective performance of information retrieval on web by using web crawling

LEARNING-based Focused WEB Crawler

An Efficient Adaptive Focused Crawler Based on Ontology Learning

Modeling Updates of Scholarly Webpages Using Archived Data

News Page Discovery Policy for Instant Crawlers.