Abstract:Relevance is generally understood as a multi-level and multi-dimensional relationship between an information need and an information object. However, traditional IR evaluation metrics naively assume mono-dimensionality. We ask: How to deal with multidimensional and graded relevance assessments in IR evaluation? Moreover, search result evaluation metrics neglect document overlaps and naively assume gains piling up as the searcher examines the ranked list into greater length. Consequently, we examine: How to deal with document overlap in IR evaluation? The usability of a document for a person-in-need also depends on document usability attributes beyond relevance. Therefore, we ask: How to deal with usability attributes, and how to combine this with multidimensional relevance assessments in IR evaluation? Finally, we ask how to define a formal model, which deals with multidimensional graded relevance assessments, document overlaps, and document usability attributes in a coherent framework serving IR evaluation?
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the improvement of information retrieval (IR) evaluation methods. Specifically, traditional information retrieval evaluation metrics such as Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) overly simply focus on the topical relevance between documents and search topics, and assume that this relationship is single - dimensional and often binary. These traditional evaluation methods ignore the overlap of document content and naively assume that as the length of the ranked list examined by the searcher increases, the gain will accumulate linearly. In addition, these methods do not take into account the role of the searcher, that is, the searcher is ignored in the background.
To solve these problems, the paper proposes a new test collection and search - result evaluation metric, which is based on multi - dimensional, non - binary relevance evaluation, explicitly models document overlap, and takes into account factors that affect document usability, not just relevance. Specifically, the paper attempts to solve the following key problems:
1. **How to handle multi - dimensional and graded relevance evaluation**: Traditional evaluation methods usually only consider single - dimensional relevance, and in most cases are binary (relevant or not relevant). The method proposed in the paper can evaluate the multi - dimensional relevance of documents to search tasks in context, where each dimension may contain one or more content topics, as well as zero or more document usability attributes.
2. **How to handle the overlap of document content**: In traditional evaluation methods, the overlap of document content is often ignored, resulting in evaluation results that do not truly reflect the actual user experience. The method proposed in the paper adjusts the evaluation results by estimating the degree of overlap of documents on different topics, thereby more accurately reflecting the actual value of documents.
3. **How to handle document usability attributes**: The usability of a document depends not only on the relevance of its content to the search task, but also on other factors, such as the readability, credibility, and language of the document. The method proposed in the paper incorporates these attributes into the evaluation framework, making the evaluation results more comprehensive and practical.
4. **How to define a formal model to handle multi - dimensional graded relevance evaluation, document overlap, and document usability attributes within a coherent framework**: The new evaluation metric proposed in the paper is called Multi - Dimensional Cumulated Utility (MDCU), and this model can comprehensively consider the above - mentioned factors within the same framework and provide more accurate and comprehensive evaluation results.
In summary, the main objective of this paper is to improve the existing information retrieval evaluation methods, make them more in line with the requirements of actual search scenarios, and improve the accuracy and practicality of evaluation results.