An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments

Naghmeh Farzi,Laura Dietz
2024-02-01
Abstract:Current IR evaluation is based on relevance judgments, created either manually or automatically, with decisions outsourced to Large Language Models (LLMs). We offer an alternative paradigm, that never relies on relevance judgments in any form. Instead, a text is defined as relevant if it contains information that enables the answering of key questions. We use this idea to design the EXAM Answerability Metric to evaluate information retrieval/generation systems for their ability to provide topically relevant information.
Information Retrieval
What problem does this paper attempt to address?