Abstract:Retrieval-augmented generation (RAG) is considered to be a promising approach to alleviate the hallucination issue of large language models (LLMs), and it has received widespread attention from researchers recently. Due to the limitation in the semantic understanding of retrieval models, the success of RAG heavily lies on the ability of LLMs to identify passages with utility. Recent efforts have explored the ability of LLMs to assess the relevance of passages in retrieval, but there has been limited work on evaluating the utility of passages in supporting question answering. In this work, we conduct a comprehensive study about the capabilities of LLMs in utility evaluation for open-domain QA. Specifically, we introduce a benchmarking procedure and collection of candidate passages with different characteristics, facilitating a series of experiments with five representative LLMs. Our experiments reveal that: (i) well-instructed LLMs can distinguish between relevance and utility, and that LLMs are highly receptive to newly generated counterfactual passages. Moreover, (ii) we scrutinize key factors that affect utility judgments in the instruction design. And finally, (iii) to verify the efficacy of utility judgments in practical retrieval augmentation applications, we delve into LLMs' QA capabilities using the evidence judged with utility and direct dense retrieval results. (iv) We propose a k-sampling, listwise approach to reduce the dependency of LLMs on the sequence of input passages, thereby facilitating subsequent answer generation. We believe that the way we formalize and study the problem along with our findings contributes to a critical assessment of retrieval-augmented LLMs. Our code and benchmark can be found at \url{<a class="link-external link-https" href="https://github.com/ict-bigdatalab/utility_judgments" rel="external noopener nofollow">this https URL</a>}.

Optimizing Science Question Ranking through Model and Retrieval-Augmented Generation

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset

Towards Optimizing a Retrieval Augmented Generation using Large Language Model on Academic Data

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Advancing Question-Answering in Ophthalmology with Retrieval Augmented Generations (RAG): Benchmarking Open-source and Proprietary Large Language Models

Are Large Language Models Good at Utility Judgments?

Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation

Enhancing Q&A with Domain-Specific Fine-Tuning and Iterative Reasoning: A Comparative Study

Enhancing Retrieval Processes for Language Generation with Augmented Queries

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering