Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

Discovering Informative Contents of Web Pages.

Web Information Segmentation Method Based on DOM Structure Tree

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Using XPath to Discover Informative Content Blocks of Web Pages

Tag Tree Template for Web Information and Schema Extraction.

Content Extraction of Web Pages Based on Characteristic Symbols

Web Content Extraction Based on Maximum Continuous Sum of Text Density.

The Technology of Extracting Content Information from Web Page Based on DOM Tree

A Web Site Mining Algorithm Using the Multiscale Tree Representation Model

A Web Site Representation and Mining Algorithm Using the Multiscale Tree Model

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Template detection for large scale search engines.

Extracting Content Structure For Web Pages Based On Visual Representation

Two-phase Web Site Classification Based on Hidden Markov Tree Models.

Building an Adaptive Site Map Based on Domain and Usage Information

DOM-Based Automatic Extraction of Topical Information from Web Pages

Chinese web page content extraction based on page content analysis

Visual Based Content Understanding Towards Web Adaptation.

C4-2: Combining Link and Contents in Clustering Web Search Results to Improve Information Interpretation

Web mining: knowledge discovery on the Web