UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Zhaoheng Huang,Zhicheng Dou,Yutao Zhu,Ji-rong Wen
2024-02-23
Abstract:Large language models (LLMs) may generate text that lacks consistency with human knowledge, leading to factual inaccuracies or \textit{hallucination}. Existing research for evaluating the factuality of LLMs involves extracting fact claims using an LLM and verifying them against a predefined fact source. However, these evaluation metrics are task-specific, and not scalable, and the substitutability of fact sources in different tasks is under-explored. To address these challenges, we categorize four available fact sources: human-written evidence, reference documents, search engine results, and LLM knowledge, along with five text generation tasks containing six representative datasets. Then, we propose \texttt{UFO}, an LLM-based unified and flexible evaluation framework to verify facts against plug-and-play fact sources. We implement five evaluation scenarios based on this framework. Experimental results show that for most QA tasks, human-written evidence and reference documents are crucial, and they can substitute for each other in retrieval-augmented QA tasks. In news fact generation tasks, search engine results and LLM knowledge are essential. Our dataset and code are available at \url{https://github.com/WaldenRUC/UFO}.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of factual errors or hallucinations that occur when large language models (LLMs) generate text. Specifically, existing evaluation methods have the following shortcomings when verifying the factual accuracy of text generated by LLMs: 1. **Task Specificity**: Existing evaluation metrics are usually designed for specific tasks and lack generality. 2. **Poor Scalability**: Existing methods are difficult to adapt to the needs of new tasks. 3. **Insufficient Exploration of Fact Source Substitutability**: The interchangeability of fact sources in different tasks has not been fully studied. To address these issues, the authors propose a unified and flexible framework called UFO to evaluate the factual accuracy of text generated by LLMs. The UFO framework integrates four different types of fact sources and conducts experimental analysis in five different evaluation scenarios. Through these experiments, the authors hope to reveal the importance and interchangeability of different fact sources in various tasks, thereby improving the reliability and generality of evaluation results.