Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Zheheng Luo,Qianqian Xie,Sophia Ananiadou

2024-02-21

Abstract:Factual inconsistency with source documents in automatically generated summaries can lead to misinformation or pose risks. Existing factual consistency(FC) metrics are constrained by their performance, efficiency, and explainability. Recent advances in Large language models (LLMs) have demonstrated remarkable potential in text evaluation but their effectiveness in assessing FC in summarisation remains underexplored. Prior research has mostly focused on proprietary LLMs, leaving essential factors that affect their assessment capabilities unexplored. Additionally, current FC evaluation benchmarks are restricted to news articles, casting doubt on the generality of the FC methods tested on them. In this paper, we first address the gap by introducing TreatFact a dataset of LLM-generated summaries of clinical texts, annotated for FC by domain experts. Moreover, we benchmark 11 LLMs for FC evaluation across news and clinical domains and analyse the impact of model size, prompts, pre-training and fine-tuning data. Our findings reveal that despite proprietary models prevailing on the task, open-source LLMs lag behind. Nevertheless, there is potential for enhancing the performance of open-source LLMs through increasing model size, expanding pre-training data, and developing well-curated fine-tuning data. Experiments on TreatFact suggest that both previous methods and LLM-based evaluators are unable to capture factual inconsistencies in clinical summaries, posing a new challenge for FC evaluation.

Computation and Language

What problem does this paper attempt to address?

The paper mainly addresses the following issues: - **Developed a new evaluation dataset**: Proposed a new dataset called TreatFact, which is the first clinical summary generated by large language models (LLMs) and annotated for factual consistency (FC) by medical experts based on a comprehensive multi-dimensional protocol. This dataset aims to complement existing factual consistency benchmarks and explore how to evaluate the summarization performance of LLMs in the clinical domain. - **Systematically studied the application of LLMs in factual consistency evaluation**: Conducted a systematic study of LLMs in the domains of news and clinical texts, analyzing the impact of factors such as model size, prompting methods, pre-training data, and fine-tuning data on the performance of LLMs in factual consistency evaluation tasks. - **Revealed the limitations of existing evaluation methods**: Found that existing factual consistency evaluation methods struggle to detect factual inconsistencies in clinical summaries generated by LLMs in the TreatFact dataset, indicating that evaluating the factual consistency of clinical summaries generated by LLMs is a challenging task. In summary, the goal of this paper is to advance the research on factual consistency evaluation, particularly in the summarization of clinical texts, by introducing a new dataset specifically for clinical summaries generated by LLMs and conducting extensive experiments to explore the factors affecting the performance of LLMs in factual consistency evaluation. Additionally, the paper highlights the limitations of current factual consistency evaluation methods in handling clinical summaries.

Factual Consistency Evaluation of Summarisation in the Era of Large Language Models

Evaluating Factual Consistency of Summaries with Large Language Models

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Factual Dialogue Summarization via Learning from Large Language Models

Fine-grained Factual Consistency Assessment for Abstractive Summarization Models

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

FELM: Benchmarking Factuality Evaluation of Large Language Models

Joint Contrastive Learning for Factual Consistency Evaluation of Cross-Lingual Abstract Summarization

Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation

Evaluating Factuality in Cross-lingual Summarization

On Learning to Summarize with Large Language Models as References

Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy

FABLES: Evaluating faithfulness and content selection in book-length summarization

Rethinking Scientific Summarization Evaluation: Grounding Explainable Metrics on Facet-aware Benchmark

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency

Learning to Verify Summary Facts with Fine-Grained LLM Feedback

An Extensive Evaluation of Factual Consistency in Large Language Models for Data-to-Text Generation

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall