Noise Elimination Method in Web Pages Based on the Similarity of Same Layer Pages

YUAN Mingxuan,ZHANG Xuanping,JIANG Yu,ZHAO Zhongmeng
DOI: https://doi.org/10.3969/j.issn.1000-3428.2006.23.022
2006-01-01
Abstract:A common Web page could be separated into two categories: valuable segments and noise segments.The first step of information retrieval on the Web is to eliminate noise segments or blocks.This paper studies the properties of Web pages and finds out that Web pages with a common URL prefix always have the similar presentation styles and noise segments.Based on vision-based page segmentation(VIPS),it proposes an approximate sub-tree matching algorithm,which could be used to eliminate noise segmentations in a Web page.The implemented algorithm could achieve 95% accurate noise block.
What problem does this paper attempt to address?