Abstract:While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at <a class="link-external link-https" href="https://github.com/r-three/fib" rel="external noopener nofollow">this https URL</a>.

Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy

Evaluating Factual Consistency of Summaries with Large Language Models

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Joint Contrastive Learning for Factual Consistency Evaluation of Cross-Lingual Abstract Summarization

Factual Dialogue Summarization via Learning from Large Language Models

Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization

Improving Factual Consistency of Text Summarization by Adversarially Decoupling Comprehension and Embellishment Abilities of LLMs

CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning

Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks

Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

On Learning to Summarize with Large Language Models as References

Annotating and Modeling Fine-grained Factuality in Summarization

On Improving Summarization Factual Consistency from Natural Language Feedback

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

Learning to Summarize from LLM-generated Feedback

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Using Similarity to Evaluate Factual Consistency in Summaries

Calibrating Likelihoods towards Consistency in Summarization Models