Chinese web page content extraction based on page content analysis

Bin Zhou,Chanjuan Wang,Qi Su
2009-01-01
Journal of Computational Information Systems
Abstract:Content of the web page is the textual information that related to the topic of the page, which is the focus of web data mining and information retrieval. For Chinese web pages, the page content is the target of word-segmentation and indexing for search engine, corpus collection of news, reviews, bogs, etc. for knowledge management researches. Extracting content of the web pages correctly and efficiently improves the accuracy of following analysis for it significantly reduces the noise in the pages, and also alleviates the workload of indexing and segmentation. In this paper we propose our method which divides web page into blocks by tag and selects the paragraphs of blocks as page content by content analysis, we provide some criteria in evaluating web page contents. Based on analyzing the features of the pages, this approach could effectively extract contents from web pages. Experiments show good results comparing to related works. 1553-9105/ Copyright © 2009 Binary Information Press.
What problem does this paper attempt to address?