Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

Content Extraction Algorithm of HTML Pages Based on Optimized Weight

Extracting Web Content by Exploiting Multi-Category Characteristics

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Automatic Extraction Of Commodity Attributes On Webpages Based On Hierarchical Structure

Relevance-based Content Extraction of HTML Documents

Extraction of Relevant Snippets from Web Pages Using Hybrid Features.

Density-Based Clustering Algorithm for Hybrid Coding Detection in Search Engines

An Improved PageRank Algorithm Based on Web Content

Extraction of Content from Web Pages Based on Magnitude of Reduction of Information Quantity

Web content extraction method based on text feature value

Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence

Research and implementation of FFT-based extraction algorithm of webpage content main body

Extracting Content from Web Pages Using the Sliding Window

An Algorithm on Web Article Automatic Extraction Based on DOM Structure

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Using XPath to Discover Informative Content Blocks of Web Pages

A HTML Parser to Improve Chinese Search Engines

Adaptive Approach for Content Extraction Based on Tag Density

Extracting Novel Features for E-Commerce Page Quality Classification

DOM-based Content Extraction of HTML Documents

Extracting Content from Web Pages Based on RSS.