Abstract:Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.

An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments

A Workbench for Autograding Retrieve/Generate Systems

A Blueprint of IR Evaluation Integrating Task and User Characteristics: Test Collection and Evaluation Metrics

A Critical Evaluation of Evaluations for Long-form Question Answering

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments

Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

Using Context to Improve the Evaluation of Information Retrieval Systems

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy

A Comparison of Methods for Evaluating Generative IR

Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I

Predicting Relevance based on Assessor Disagreement: Analysis and Practical Applications for Search Evaluation

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Generative Information Retrieval Evaluation

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Relevance Judgment Convergence Degree -- A Measure of Inconsistency among Assessors for Information Retrieval

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

When does Relevance Mean Usefulness and User Satisfaction in Web Search?