Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

The research and implementation of web information extraction technology based on multi-level pages

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Research on Automatic Extraction Technology of Web Information

The Web information extraction technology research based on XML description

Automatic Extraction of Semi-structured Web Data

A robust approach of automatic web data record extraction

Solution for Automatic Web Review Extraction

Ontology-Based Two-Phase Semi-Automatic Web Extracting

Feature Matrix Extraction And Classification Of Xml Pages

DOM-Based Multi-Factor Web Information Extraction Study

Web Information Extraction Based on Repeated Pattern

Adaptive Web Information Extraction Based on DOM Tree

A Novel Method for Extracting Information from Web Pages with Multiple Presentation Templates.

Adaptively Extracting Structured Data from Web Pages

Web Content Extraction & Its Data Management Method

A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Web Information Extraction Algorithm Based on Ontology and DOM Tree

Analysis and Comparison of Web Information Extraction Technologies

A Semantic DOM Approach for Webpage Information Extraction

Research on Intelligent Information Search Based on Web

Automatic Data Extraction from Data-Rich Web Pages