Abstract:Extracting web content is to obtain the required data embedded in web pages, usually including structured records, such as product information, and text content, such as news. Web pages use a large number of HTML tags to organize and to present various information. Both knowing little about the structures of web pages and mixing kinds of information in web pages are making the extraction process very challenging to guarantee extraction performance and extraction adaptability. This study proposes a unified web content extraction framework that can be applied in various web environments to extract both structured records and text content. First, we construct a characteristic container to hold kinds of characteristics related with extraction objectives, including visual text information, content semantics(instead of HTML tag semantics), web page structures, etc. Second, the above characteristics are integrated into an extraction framework for extraction decisions on different web sites. Especially, we put forward different strategies, path aggregation for extracting text content and HMM model for structured records, to locate the extraction area by exploiting both those extraction characteristics. Comparative experiments on multiple web sites with popular extraction methods, including CETR, CETD and CNBE, show that our proposed extraction method can provide better extraction precision and extraction adaptability.

Web information extraction algorithm based on Web page segmentation

Web Information Segmentation Method Based on DOM Structure Tree

Analysis and Implementation of Extraction Algorithm of Web Hierarchy Structure

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

An Algorithm on Web Article Automatic Extraction Based on DOM Structure

Web Page Segmentation and Its Application for Web Information Crawling

Web Content Extraction Using Clustering with Web Structure.

Content Extraction Method Combining Web Page Structure and Text Feature

DeSeA: A Page Segmentation Based Algorithm for Information Extraction

A Block Segmentation Based Approach for Web Information Extraction

A Novel Method for the Web Page Segmentation and Identification

Study of Web Information Extraction and Classification Method

Tag Tree Template for Web Information and Schema Extraction.

TPS: An unsupervised web page segmentation algorithm based on DOM tree structure mining

Web Pages Information Retrieval Based on Keywords Cluster and Node Instance

A Method of Web Information Extraction Based on Classification Algorithm

Automatic Information Extraction from Semi-structured and Multi-record Web Pages

Chinese web page content extraction based on page content analysis

Web Information Extraction Based on Probabilistic Model

Research on Rapid Information Extractionin Web Content Security

Web Information Extractor Based on Extended Tag Graph