Web Key Resource Page Selection Based on Non-Content Information

LIU Yi-qun,ZHANG Min,MA Shao-ping
DOI: https://doi.org/10.3969/j.issn.1673-4785.2007.01.006
2007-01-01
Abstract:Information growth makes it impossible for search engines to crawl and index all pages on the Web.Meanwhile indexed page set is filled with low quality information and spam.It is quite a challenge to select high quality Web pages(key resource pages)query-independently.With analysis in non-content features of key resources,a pre-selection method was introduced in topic distillation research.A decision tree was constructed to locate key resource pages using query-independent non-content features including in-degree,document length,URL-type and two novel proposed features involving site's self-link structure analysis.Although the result page set contained only about 20% pages of the whole collection,it covered more than 70% of key resources.Furthermore,information retrieval on this page set made more than 60% improvement with respect to that on all pages.It shows an effective way to get better performance in topic distillation with a smaller data set.
What problem does this paper attempt to address?