Abstract:IR group of Tsinghua University this year has used its TMiner text retrieval system for indexing and retrieval of the Terabyte track ad hoc and named-page subtasks. In doing the two tasks, we used the in-link anchor texts (the anchor of the URLs that point to the current page in the collection) together with the content texts of the web pages for building the indices. When retrieving, the word-pair method (1) was used and proved effective on 2004 and 2005 Terabyte ad hoc task topics and the 2005 named-page task. We provide further analysis of the performance of word-pair method in comparison with the Markov random field term dependence model of (2) and another generative phrase model we proposed, which is more natural on the language modeling framework (3). 1. TMiner at Terabyte 2005 On a PC of 2GB memory, with one CPU and IDE hard disks, TMiner could index 50GB text (about 200GB HTML files) with tolerable time. But since the terabyte collection contains about 100GB pure text (110GB including anchor texts), building one single index for such a large collection would cost TMiner too much time. We built 27 indices for the 27 parts of the collection in our experiments. When retrieving, we summed the DF values of the query terms from each index, and assigned the BM2500 RSV to documents in the collection according to the DF sum. This distributed index system returns exact RSV as if only one single index is constructed for the whole collection (at the expense of additional query processing time). In the ad hoc and named-page tasks, the index of in-link anchor combined with page content was used. This is the most effective way of combining anchor text for retrieval in our observation and we didn't build indices that contain no in-link anchor for comparison. In addition to the use of anchor text, since the indices we built contains full position information for the index terms, the word-pair method (1) was used in both tasks.

THU TREC2002 Web Track Experiments

THU TREC 2002: Novelty Track Experiments.

TREC-10 Web Track Experiments at MSRA.

At the web track of TREC 2003

Web Key Resource Page Selection Based on Non-Content Information

Web Key Resource Page Judgment Based on Improved Decision Tree Algorithm

THUIR at TREC2007: Enterprise track 1

THUIR at TREC 2009 Web Track: Finding Relevant and Diverse Results for Large Scale Web Search

THUIR at TREC 2007: Enterprise Track

Expansiion-Based Technologies in Finding Relevant and New Information: THU TREC2002 Novelty Track Experiments

Effective Topic Distillation with Key Resource Pre-Selection

THUIR at TREC 2003: Novelty, Robust and Web.

THUIR at TREC 2008: Relevance Feedback Track

Topic-independent web high-quality page selection based on k-means clustering

Towards A Quality-Oriented Real-Time Web Crawler

THUIR at TREC2008: Enterprise track

THUIR at TREC 2005 Terabyte Track.

A Predication-Based Approach for Effective Resource Discovery in Topical Web

Finding "abstract Fields" of Web Pages and Query Specific Retrieval - THUIR at TREC 2004 Web Track

On-Line Selection Of Distinguishing Elements For Focused Information Retrieval

Study on Link-based Approaches for Web IR in TREC Experiments