Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

Dom Semantic Expansion-Based Extraction Of Topical Information From Web Pages

DOM-Based Automatic Extraction of Topical Information from Web Pages

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web Information Segmentation Method Based on DOM Structure Tree

A Semantic DOM Approach for Webpage Information Extraction

Automatic Extraction Of Commodity Attributes On Webpages Based On Hierarchical Structure

Attributes extraction of Deep Web query interface based on DOM

Adaptive Web Information Extraction Based on DOM Tree

DOM-Based Multi-Factor Web Information Extraction Study

Managing Knowledge on the Web - Extracting Ontology from Html Web

DOM-based Content Extraction of HTML Documents

Snextractor: A Prototype For Extracting Semantic Networks From Web Documents

The Technology of Extracting Content Information from Web Page Based on DOM Tree

Data extraction from web pages based on structural-semantic entropy.

The Web information extraction technology research based on XML description

Combing Node Frequency and Semantic Feature for Webpage Informative Content Extraction

An Adaptive Web Information Extraction Approach Based on Stu-Dom Tree

FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Extraction of Relevant Snippets from Web Pages Using Hybrid Features.

Simplified DOM Trees for Transferable Attribute Extraction from the Web

An Algorithm on Web Article Automatic Extraction Based on DOM Structure