A Workbench for Autograding Retrieve/Generate Systems

Laura Dietz
2024-05-22
Abstract:This resource paper addresses the challenge of evaluating Information Retrieval (IR) systems in the era of autoregressive Large Language Models (LLMs). Traditional methods relying on passage-level judgments are no longer effective due to the diversity of responses generated by LLM-based systems. We provide a workbench to explore several alternative evaluation approaches to judge the relevance of a system's response that incorporate LLMs: 1. Asking an LLM whether the response is relevant; 2. Asking the LLM which set of nuggets (i.e., relevant key facts) is covered in the response; 3. Asking the LLM to answer a set of exam questions with the response.
Information Retrieval
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address the challenges faced in evaluating Information Retrieval (IR) systems in the era of large autoregressive language models (LLMs). Traditional evaluation methods rely on paragraph-level relevance judgments, but this approach is no longer suitable when LLMs generate diverse and slightly different responses each time. Specifically, the paper proposes a workbench to explore several alternative evaluation methods to determine the relevance of system responses: 1. **Ask the LLM if the response is relevant**: Let the LLM judge whether the response is relevant to the query. 2. **Ask the LLM which key facts are covered**: Let the LLM identify the set of key facts included in the response. 3. **Answer a set of exam questions with the response**: Let the LLM answer a series of exam questions based on the response. The goal of this workbench is to facilitate the development of new, reusable test collections. Researchers can manually refine the key fact sets and exam questions to observe their impact on system evaluation and leaderboard rankings. In this way, the paper attempts to provide a new evaluation paradigm to address the challenges brought by current LLMs.