Abstract:Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance. © 2007 Wiley Periodicals, Inc.

Web Key Resource Page Selection Based on Non-Content Information

Effective Topic Distillation with Key Resource Pre-Selection

Web data cleansing for information retrieval using key resource page selection.

Web Key Resource Page Judgment Based on Improved Decision Tree Algorithm

Topic-independent web high-quality page selection based on k-means clustering

Web Data Cleansing for Effective Information Retrieval

An Ontology-based Approach to Topic-specific Web Resource Discovery

Research on Indexing Page Collection Selection Method for Search Engine.

A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval

Learning-based Web Data Cleansing for Information Retrieval

A Predication-Based Approach for Effective Resource Discovery in Topical Web

Topic Distillation Via Sub-Site Retrieval

THU TREC2002 Web Track Experiments

Subsite Retrieval: A Novel Concept for Topic Distillation.

LCA-Based Keyword Search for Effectively Retrieving "Information Unit" from Web Pages

On-Line Selection Of Distinguishing Elements For Focused Information Retrieval

An Improved PageRank Algorithm Based on Web Content

C4-2: Combining Link and Contents in Clustering Web Search Results to Improve Information Interpretation

Website Crawling for Specific Topics

Data Cleansing for Web Information Retrieval Using Query Independent Features

Discovering Informative Contents of Web Pages.