Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

Negar Arabzadeh,Amin Bigdeli,Charles L. A. Clarke

2024-01-10

Abstract:Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these answers, for comparing the performance of one model to another, or for comparing one prompt to another. In addition, the quality of generated answers are rarely directly compared to the quality of retrieved answers. As models evolve and prompts are modified, we have no systematic way to measure improvements without resorting to expensive human judgments. To address this problem we adapt standard retrieval benchmarks to evaluate answers generated by large language models. Inspired by the BERTScore metric for summarization, we explore two approaches. In the first, we base our evaluation on the benchmark relevance judgments. We empirically run experiments on how information retrieval relevance judgments can be utilized as an anchor to evaluating the generated answers. In the second, we compare generated answers to the top results retrieved by a diverse set of retrieval models, ranging from traditional approaches to advanced methods, allowing us to measure improvements without human judgments. In both cases, we measure the similarity between an embedded representation of the generated answer and an embedded representation of a known, or assumed, relevant passage from the retrieval benchmark.

Information Retrieval

What problem does this paper attempt to address?

This paper focuses on how to evaluate the quality and accuracy of large language models (LLMs) in directly generating answers without referencing external resources. Currently, there is a lack of effective evaluation methods for comparing the performance of different models or prompts, as well as the quality comparison between generated answers and retrieval answers. Inspired by the successful application of BERTScore in tasks such as summarization, the researchers propose two methods to adapt standard retrieval benchmarks to evaluate the answers generated by LLMs. First, they use the relevance judgments from retrieval benchmarks as anchors and empirically demonstrate how to use these judgments to evaluate the generated answers. Second, they compare the generated answers with top results from various traditional and neural retrieval models, allowing for quantification of improvements without human judgments. By comparing the similarity between the embedded representations of the generated answers and known relevant retrieval passages, they conduct experiments on the MS MARCO, TREC Deep Learning 2019, and 2020 datasets, demonstrating the effectiveness of this approach. The paper points out that although traditional metrics may not fully reflect the subtle differences of LLMs, comparing them with retrieval results can better understand the strengths and weaknesses of the generation and retrieval models, which helps identify areas for improvement. The experimental results show that the improvement of LLMs' generated answers can be systematically measured using information retrieval benchmarks without relying on expensive human judgments.

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Retrieving Supporting Evidence for Generative Question Answering

Measuring the Groundedness of Legal Question-Answering Systems

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Reference-based Metrics Disprove Themselves in Question Generation

RetrievalQA: Assessing Adaptive Retrieval-Augmented Generation for Short-form Open-Domain Question Answering

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

Benchmarking Foundation Models with Language-Model-as-an-Examiner

Benchmarking LLMs' Judgments with No Gold Standard

AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

Benchmarking Large Language Models in Retrieval-Augmented Generation

BERGEN: A Benchmarking Library for Retrieval-Augmented Generation

Evaluating Generative Ad Hoc Information Retrieval

GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Optimizing Science Question Ranking through Model and Retrieval-Augmented Generation