Abstract:Information retrieval systems increasingly incorporate generative components. For example, in a retrieval augmented generation (RAG) system, a retrieval component might provide a source of ground truth, while a generative component summarizes and augments its responses. In other systems, a large language model (LLM) might directly generate responses without consulting a retrieval component. While there are multiple definitions of generative information retrieval (Gen-IR) systems, in this paper we focus on those systems where the system's response is not drawn from a fixed collection of documents or passages. The response to a query may be entirely new text. Since traditional IR evaluation methods break down under this model, we explore various methods that extend traditional offline evaluation approaches to the Gen-IR context. Offline IR evaluation traditionally employs paid human assessors, but increasingly LLMs are replacing human assessment, demonstrating capabilities similar or superior to crowdsourced labels. Given that Gen-IR systems do not generate responses from a fixed set, we assume that methods for Gen-IR evaluation must largely depend on LLM-generated labels. Along with methods based on binary and graded relevance, we explore methods based on explicit subtopics, pairwise preferences, and embeddings. We first validate these methods against human assessments on several TREC Deep Learning Track tasks; we then apply these methods to evaluate the output of several purely generative systems. For each method we consider both its ability to act autonomously, without the need for human labels or other input, and its ability to support human auditing. To trust these methods, we must be assured that their results align with human assessments. In order to do so, evaluation criteria must be transparent, so that outcomes can be audited by human assessors.

A Workbench for Autograding Retrieve/Generate Systems

Towards LLM-based Autograding for Short Textual Answers

Generative Information Retrieval Evaluation

A Comparison of Methods for Evaluating Generative IR

An Exam-based Evaluation Approach Beyond Traditional Relevance Judgments

Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need

Evaluating the Retrieval Component in LLM-Based Question Answering Systems

Can LLMs Grade Short-Answer Reading Comprehension Questions : An Empirical Study with a Novel Dataset

When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

A Survey on LLM-as-a-Judge

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text

How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?

Synthetic Test Collections for Retrieval Evaluation

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help?

Best in Tau@LLMJudge: Criteria-Based Relevance Evaluation with Llama3

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models