A HTML Parser to Improve Chinese Search Engines

宋睿华,马少平,陈刚,李景阳
DOI: https://doi.org/10.3969/j.issn.1003-0077.2003.04.003
2003-01-01
Abstract:While using search engine, people always find so many irrelevant or peripherally relevant items in the result list. Most of them are produced by the words irrelevant to the topic of a web page. It is costly or even impossible to remove such items using traditional keyword methods. In this paper, we define the concept of noise in web pages, and propose a novel approach to clean the noise information of web pages in the pre-processing stage. A novel model of Chinese web pages and 4 simple rules are build to discard noise from HTML files. Experimental results show that, all the indirect items that appear in the results of site grouping are removed correctly and about 11% irrelevant or indirect items that cannot be excluded by commercial Chinese search engines are removed by our approach.
What problem does this paper attempt to address?