Approach of Eliminating Noise Based on Framework of Web Pages and Rules
SHI Da-ming,LIN Hong-fei,YANG Zhi-hao
DOI: https://doi.org/10.3969/j.issn.1000-3428.2007.19.098
2007-01-01
Abstract:This paper presents an approach to eliminate noise based on framework of Web pages and rules.This approach divides a page into several parts according to HTML tag table in a Web page,then compares the ratio of width and height attributes of every table and deletes the part of bigger ratio.To the rest tables,topic and noise content are differentiated according to tag p or br related to paragraph,the noise content is eliminated based on this way.Experiments performed on a set of 132 559 Web pages from CWT200G show that this approach can eliminate noise content of Web pages effectively and decrease the size of index files to about 75%.The information retrieval speed can be faster,and the accuracy of retrieval can be improved.
What problem does this paper attempt to address?