Abstract:While large language models (LLMs) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. To measure whether an LLM prefers factually consistent continuations of its input, we propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. Specifically, our benchmark involves comparing the scores an LLM assigns to a factually consistent versus a factually inconsistent summary for an input news article. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. To generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. A model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. To validate the usefulness of FIB, we evaluate 23 large language models ranging from 1B to 176B parameters from six different model families including BLOOM and OPT. We find that existing LLMs generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. However, if the factually inconsistent summaries occur verbatim in the document, then LLMs assign a higher score to these factually inconsistent summaries than factually consistent summaries. We validate design choices in our benchmark including the scoring method and source of distractor summaries. Our code and benchmark data can be found at <a class="link-external link-https" href="https://github.com/r-three/fib" rel="external noopener nofollow">this https URL</a>.

Learning to Verify Summary Facts with Fine-Grained LLM Feedback

Learning to Summarize from LLM-generated Feedback

On Learning to Summarize with Large Language Models as References

Factual Dialogue Summarization via Learning from Large Language Models

Evaluating Factual Consistency of Summaries with Large Language Models

FactLens: Benchmarking Fine-Grained Fact Verification

Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

FactLLaMA: Optimizing Instruction-Following Language Models with External Knowledge for Automated Fact-Checking

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Mining the Explainability and Generalization: Fact Verification Based on Self-Instruction

Annotating and Modeling Fine-grained Factuality in Summarization

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

On Improving Summarization Factual Consistency from Natural Language Feedback

Long-form factuality in large language models

Improving Model Factuality with Fine-grained Critique-based Evaluator

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

FELM: Benchmarking Factuality Evaluation of Large Language Models

Improving Factuality of Abstractive Summarization via Contrastive Reward Learning

Are Large Language Models Table-based Fact-Checkers?

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning