Abstract:IR group of Tsinghua University this year has used its TMiner text retrieval system for indexing and retrieval of the Terabyte track ad hoc and named-page subtasks. In doing the two tasks, we used the in-link anchor texts (the anchor of the URLs that point to the current page in the collection) together with the content texts of the web pages for building the indices. When retrieving, the word-pair method (1) was used and proved effective on 2004 and 2005 Terabyte ad hoc task topics and the 2005 named-page task. We provide further analysis of the performance of word-pair method in comparison with the Markov random field term dependence model of (2) and another generative phrase model we proposed, which is more natural on the language modeling framework (3). 1. TMiner at Terabyte 2005 On a PC of 2GB memory, with one CPU and IDE hard disks, TMiner could index 50GB text (about 200GB HTML files) with tolerable time. But since the terabyte collection contains about 100GB pure text (110GB including anchor texts), building one single index for such a large collection would cost TMiner too much time. We built 27 indices for the 27 parts of the collection in our experiments. When retrieving, we summed the DF values of the query terms from each index, and assigned the BM2500 RSV to documents in the collection according to the DF sum. This distributed index system returns exact RSV as if only one single index is constructed for the whole collection (at the expense of additional query processing time). In the ad hoc and named-page tasks, the index of in-link anchor combined with page content was used. This is the most effective way of combining anchor text for retrieval in our observation and we didn't build indices that contain no in-link anchor for comparison. In addition to the use of anchor text, since the indices we built contains full position information for the index terms, the word-pair method (1) was used in both tasks.

THUIR at TREC 2009 Web Track: Finding Relevant and Diverse Results for Large Scale Web Search

THUIR at TREC 2003: Novelty, Robust and Web.

THUIR at TREC 2008: Relevance Feedback Track

Finding "abstract Fields" of Web Pages and Query Specific Retrieval - THUIR at TREC 2004 Web Track

Improved Feature Selection and Redundance Computing - THUIR at TREC 2004 Novelty Track.

THUIR at TREC 2004: Genomics Track.

THUIR at TREC2008: Enterprise track

THUIR at TREC 2005 Terabyte Track.

THUIR at TREC2008: Blog Track 1

THUIR at NTCIR-9 INTENT Task.

THUIR at TREC 2004: QA.

THUIR at TREC 2005: Enterprise Track.

THUIR at TREC 2007: Enterprise Track

THUIR at TREC2007: Enterprise track 1

THUIR at NTCIR-10 INTENT-2 Task

THUIR at TREC 2003: HARD Experiments.

THUIR at the NTCIR-14 WWW-2 Task.

THUIR at NTCIR-13 WWW Task.

THUIR at NTCIR-12 IMine Task.

At the web track of TREC 2003

Learning Domain-Specific Knowledge from Context--THUIR at TREC 2005 Genomics Track.