Topic-independent web high-quality page selection based on k-means clustering

Canhui Wang,Yiqun Liu,Min Zhang,Shaoping Ma
DOI: https://doi.org/10.1007/11562382_43
2005-01-01
Abstract:One of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. Web high-quality pages provide readers proper entries to get more concentrated required information on the web. This paper focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on high-quality page selection are studied. Then K-means clustering with these features is performed to separate high-quality pages from common ones. Experiments on 19GB (document size) TREC web data set (.GOV data) have been made. By this proposed approach, less than 50% of web pages are obtained as high-quality ones, covering about 90% key information in the whole set. Information retrieval on this high-quality page set achieves more than 40% improvement, compared with that on the whole data collection.
What problem does this paper attempt to address?