Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

Study On Method Of Web Content Mining For Non-Xml Documents

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Web Content Extraction Based on Maximum Continuous Sum of Text Density.

Automatic Extraction Of Commodity Attributes On Webpages Based On Hierarchical Structure

Research on Web Mining Technique Facing Electronic Business and Application

Web mining of relations from XML and construct database schema

The Web information extraction technology research based on XML description

Content Extraction of Web Pages Based on Characteristic Symbols

Web Content Extraction & Its Data Management Method

Using XPath to Discover Informative Content Blocks of Web Pages

Application of Web Text Mining in Study Assistance

Feature Matrix Extraction And Classification Of Xml Pages

Chinese web page content extraction based on page content analysis

A Data Mining Approach To Xml Dissemination

Research of Web Information Mining by Using Crawler Techniques

Analysis and Comparison of Web Information Extraction Technologies

Content Extraction Method Combining Web Page Structure and Text Feature

A Statistical Approach for Content Extraction from Web Page

A New Web Mining Data Integration Model Based on XML

Relevance-based Content Extraction of HTML Documents

Web mining: knowledge discovery on the Web