Document Routing as Statistical Classiication the Routing Problem Step 1: Local Regions Step 2: Document Representations N(nr+nn? ? Nr?nn+) 2 Summary of Routing Algorithm

Jan O. Pedersen
Abstract:In this paper, we compare learning techniques based on statistical classiication to traditional methods of relevance feedback for the document routing problem. We consider three classiication techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression , and neural networks. We demonstrate that the classiiers perform 10-15% better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks. Of the two classical information retrieval tasks 1 document routing is most amenable to machine learning. A xed, standing query, and a training collection of judged documents 2 is provided and the task is to assess the relevance of a fresh set of test documents. This can clearly be approached as a problem of statistical text classiication: documents are to be assigned to one of two categories, relevant or non-relevant, and inference is possible from the labeled documents. In contrast, the classical ad-hoc search problem presumes only a query and an unlabelled collection is provided. The standard approach to document routing models document content as a bag-of-words, represented as a sparse, very high-dimensional vector, with one component for each unique term in the vocabulary (Salton, Wong, & Yang 1975). Vector weights are proportional to term frequency and inversely proportional to collection frequency. 3 The general technique is to score test documents with respect to their closeness to the query (also represented a sparse, high-dimensional vector), Authors listed in alphabetic order. 1 as deened and evaluated by the TREC confer-ences(Harman 1994; 1995) 2 Actually, only a few documents are explicitly labeled, including most of the relevant documents and a few of the irrelevant documents. All other documents are implicitly assumed to be irrelevant. 3 The exact expression varies across systems, but is typ-where closeness is measured by the cosine between vectors. A modiied and expanded query is learned from the training set via Rocchio-expansion Relevance Feedback (Buckley, Salton, & Allan 1994), which essentially constructs a linear combination of the query vector, the centroid of the relevant documents and, occasionally, the centroid of select irrelevant documents 4. The net result is a scored list of test documents, which may be ranked in decreasing score order for the purposes of presentation and evaluation. Evaluation typically proceeds by averaging precision 5 at a number of recall 6 thresholds. Rocchio-expansion Relevance Feedback employs a weak learning method. However, the application of stronger methods faces two problems: the …
What problem does this paper attempt to address?