Abstract:In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website's templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages' modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.

News Page Discovery Policy for Instant Crawlers.

Identify Temporal Websites Based on User Behavior Analysis.

Towards A Quality-Oriented Real-Time Web Crawler

Method of Collecting and Analyzing News Pages on Internet

SiteRank-Based Crawling Ordering Strategy for Search Engines

Web Evolution and Incremental Crawling

Efficient World-Wide-Web Information Gathering

Context-aware advertisement recommendation for high-speed social news feeding

A cognitive crawler using structure pattern for incremental crawling and content extraction

Incorporating Site-Level Knowledge For Incremental Crawling Of Web Forums: A List-Wise Strategy

A Predication-Based Approach for Effective Resource Discovery in Topical Web

Research of Vertical Search Engine in News Industry

Workload-Aware Web Crawling and Server Workload Detection

An Efficient Valid Page Crawling Approach for Websites with Dynamic Scripts

Selective Recrawling for Object-Level Vertical Search.

A pattern-based selective recrawling approach for object-level vertical search.

LEARNING-based Focused WEB Crawler

A Novel Combine Forecasting Method for Predicting News Update Time.

Data acquisition strategy for FTP search engine

Modeling Updates of Scholarly Webpages Using Archived Data

Schedule Web Forum Crawling with a Freshness-First Strategy