FELM: Benchmarking Factuality Evaluation of Large Language Models

Shiqi Chen,Yiran Zhao,Jinghan Zhang,I-Chun Chern,Siyang Gao,Pengfei Liu,Junxian He

2023-11-28

Abstract:Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g.~information from Wikipedia), felm focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on felm, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the factuality of text generated by large language models (LLMs). Although LLMs have achieved remarkable success in many tasks, they tend to generate false information or hallucinate content, which limits their broader applications. Therefore, there is an urgent need for factuality evaluators that can detect factual errors in LLMs' responses, to alert users to potential risks and promote the development of more reliable LLMs. However, current factuality evaluators themselves also need proper evaluation in order to measure progress and facilitate technological advancement. There is relatively little research in this direction, resulting in significant obstacles to the development of factuality evaluators. To alleviate this problem, the paper introduces a new benchmark - FELM (Factuality Evaluation of Large Language Models) for evaluating the factuality of text generated by LLMs. FELM not only covers factuality in the domain of world knowledge, but also extends to multiple areas such as mathematics and reasoning. It aims to precisely locate specific factual errors through fine - grained text - segment annotation, and to support or refute statements through predefined error types and reference links. In addition, the paper also explores the performance of several LLMs - based factuality evaluators on FELM, including unenhanced LLMs and LLMs combined with retrieval mechanisms and chain - of - thought processes. The study finds that while retrieval helps improve the accuracy of factuality evaluation, current LLMs are still far from satisfactory in faithfully detecting factual errors.

FELM: Benchmarking Factuality Evaluation of Large Language Models

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Long-form factuality in large language models

Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs

Improving Model Factuality with Fine-grained Critique-based Evaluator

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Factuality of Large Language Models: A Survey

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore

FactLens: Benchmarking Fine-Grained Fact Verification

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Evaluating Factual Consistency of Summaries with Large Language Models

An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation