Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

Using XPath to Discover Informative Content Blocks of Web Pages

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Extracting Web Content by Exploiting Multi-Category Characteristics

Web Information Segmentation Method Based on DOM Structure Tree

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

Discovering Informative Contents of Web Pages.

The Technology of Extracting Content Information from Web Page Based on DOM Tree

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Detecting and Monitoring Dynamic Content Blocks of a Web Page by Merging its Historical Versions ∗

Chinese web page content extraction based on page content analysis

Content Extraction of Web Pages Based on Characteristic Symbols

Effective Blog Pages Extractor for Better UGC Accessing

Navigation Objects Extraction for Better Content Structure Understanding

DOM-based Content Extraction of HTML Documents

Improving XML Search by Generating and Utilizing Informative Result Snippets

DOM-Based Automatic Extraction of Topical Information from Web Pages

A HTML Parser to Improve Chinese Search Engines

An Efficient Valid Page Crawling Approach for Websites with Dynamic Scripts

On-Line Topical Importance Estimation: an Effective Focused Crawling Algorithm Combining Link and Content Analysis

Learning Important Models for Web Page Blocks Based on Layout and Content Analysis

Tag Tree Template for Web Information and Schema Extraction.