Abstract:ABSTRACTIn addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) - a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods.

Web News Extraction Based on Path Pattern Mining

Extracting Web News Using Tag Path Patterns

Web News Extraction Via Tag Path Feature Fusion Using DS Theory

Web news extraction via path ratios.

A Novel Approach To Automatically Extracting Main Content of Web News

Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure

Web Content Extraction Based on Maximum Continuous Sum of Text Density.

A generic Web news extraction approach

Learning to Extract Web News Title in Template Independent Way

A Template Independent Approach for Web News and Blog Content Extraction

Web Information Extraction Based on Similar Patterns

Extracting Various Types of Informative Web Content Via Fuzzy Sequential Pattern Mining.

An efficient method for extracting web news content

Title-Based Extraction of News Contents for Text Mining.

Hybrid method for automated news content extraction from the web

Can We Learn a Template-Independent Wrapper for News Article Extraction from a Single Training Site?

Web Information Extraction Based on News Domain Ontology Theory

Chinese Web News Source Extraction Algorithm Based On Rules And Region Recognition

Design and Implementation of a Web News Extraction System.

A method of Web news extraction based on decision tree

RESEARCH ON MODEL OF NETWORK INFORMATION EXTRACTION BASED ON IMPROVED TOPIC-FOCUSED WEB CRAWLER KEY TECHNOLOGY