Abstract:While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at <a class="link-external link-https" href="https://github.com/r-three/fib" rel="external noopener nofollow">this https URL</a>.

Benchmarking Large Language Models for News Summarization

On Learning to Summarize with Large Language Models as References

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Scaling Up Video Summarization Pretraining with Large Language Models

Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

Can Large Language Models Serve as Evaluators for Code Summarization?

Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

An End-to-End Speech Summarization Using Large Language Model

Analyzing the Performance of Large Language Models on Code Summarization

Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing

On Context Utilization in Summarization with Large Language Models

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization

In-context Learning of Large Language Models for Controlled Dialogue Summarization: A Holistic Benchmark and Empirical Analysis

Summarization is (Almost) Dead

UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches

TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale