Abstract:The simplification of key tasks of search engine users by directly returning structured knowledge according to their query intents has attracted much attention from both the industry and the academia. The challenge lies in automatically extracting structured knowledge from noisy and complex web scale websites. Although various automatic wrapper induction algorithms have been proposed, ineffectiveness or inefficiency issues beset many of their web scale applications. In this paper, we propose an unsupervised automatic wrapper induction algorithm, named SKES, to efficiently extract knowledge from semi-structured websites. SKES induces the wrapper in a divide-and-conquer mode; dividing the general wrapper into sub-wrappers that can independently learn from data, making it efficient and easy to implement in a parallel mode. Moreover, by employing techniques such as tag path representation of web pages, SKES can dramatically reduce the number of tags and naturally differentiate their roles. The proposed solution was applied and evaluated on a large number of real websites as well as compared with two existing methods that are most related to it. The proposed method is much more efficient than the existing methods, and provided high extraction accuracy. We have extracted 2.5million entities and 29million data fields from over 10 thousand high traffic websites, which demonstrates the applicability of this method. Furthermore, based on the automatically extracted data, we built a prototype to serve structured knowledge that simplifies the key search tasks of end users. The feedback received for the prototype was highly positive.

Web Pages Information Retrieval Based on Keywords Cluster and Node Instance

Extracting Web Content by Exploiting Multi-Category Characteristics

Web Information Segmentation Method Based on DOM Structure Tree

LCA-Based Keyword Search for Effectively Retrieving "Information Unit" from Web Pages

Using XPath to Discover Informative Content Blocks of Web Pages

Web Content Extraction & Its Data Management Method

Scalable and Noise Tolerant Web Knowledge Extraction for Search Task Simplification.

An efficient clustering algorithm for large-scale topical web pages.

The Technology of Extracting Content Information from Web Page Based on DOM Tree

The Research of Web Mining

Automatic Web Information Extraction Based on Repetitive Pattern

Web Key Resource Page Selection Based on Non-Content Information

Content Extraction of Web Pages Based on Characteristic Symbols

News-oriented Automatic Chinese Keyword Indexing

Ontology-Based Two-Phase Semi-Automatic Web Extracting

Application and Design of Web Information Extraction System Based on Pattern Discovery

Web Information Extraction System Based on Knowledge Graph

Website Crawling for Specific Topics

Tag Tree Template for Web Information and Schema Extraction.

Chinese web page content extraction based on page content analysis

Solution for Automatic Web Review Extraction