Abstract:In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website's templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages' modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.

HIDDEN WEBPAGE INFORMATION EXTRACTION ALGORITHM USING DOM STATE TRANSFER

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web Information Segmentation Method Based on DOM Structure Tree

Automatic Extraction Of Commodity Attributes On Webpages Based On Hierarchical Structure

The Technology of Extracting Content Information from Web Page Based on DOM Tree

Defense of Hidden Backdoor Technology for Web

DOM-Based Automatic Extraction of Topical Information from Web Pages

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Web Page Content Extraction Based on Multi-feature Fusion

Simplified DOM Trees for Transferable Attribute Extraction from the Web

DOM-based Content Extraction of HTML Documents

Detecting and Monitoring Dynamic Content Blocks of a Web Page by Merging its Historical Versions ∗

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Design and Implementation of Domain based Semantic Hidden Web Crawler

A Comparative Study of Hidden Web Crawlers

Optimal Algorithms for Crawling a Hidden Database in the Web

A cognitive crawler using structure pattern for incremental crawling and content extraction

An Efficient Valid Page Crawling Approach for Websites with Dynamic Scripts

Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence