LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation

Jennifer A Bishop,Qianqian Xie,Sophia Ananiadou
2024-05-28
Abstract:Maintaining factual consistency is a critical issue in abstractive text summarisation, however, it cannot be assessed by traditional automatic metrics used for evaluating text summarisation, such as ROUGE scoring. Recent efforts have been devoted to developing improved metrics for measuring factual consistency using pre-trained language models, but these metrics have restrictive token limits, and are therefore not suitable for evaluating long document text summarisation. Moreover, there is limited research and resources available for evaluating whether existing automatic evaluation metrics are fit for purpose when applied in long document settings. In this work, we evaluate the efficacy of automatic metrics for assessing the factual consistency of long document text summarisation. We create a human-annotated data set for evaluating automatic factuality metrics, LongSciVerify, which contains fine-grained factual consistency annotations for long document summaries from the scientific domain. We also propose a new evaluation framework, LongDocFACTScore, which is suitable for evaluating long document summarisation. This framework allows metrics to be efficiently extended to any length document and outperforms existing state-of-the-art metrics in its ability to correlate with human measures of factuality when used to evaluate long document summarisation data sets. We make our code and LongSciVerify data set publicly available: <a class="link-external link-https" href="https://github.com/jbshp/LongDocFACTScore" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of existing automatic evaluation metrics in the factual consistency evaluation of long - document summaries. Specifically: 1. **Factual Consistency Evaluation**: Existing automatic evaluation metrics such as ROUGE scores perform poorly in evaluating the factual consistency of text summaries and cannot be well - correlated with human evaluation results. 2. **Long - Document Evaluation Limitations**: Many evaluation metrics based on pre - trained language models cannot be effectively applied to the evaluation of long - document summaries due to limitations in the number of tokens they can process. These metrics usually need to truncate most of the long - document content, leading to inaccurate evaluation. 3. **Lack of Evaluation Resources**: Currently, there is a lack of resources and datasets for the factual consistency evaluation of long - document summaries, which limits research progress. To solve these problems, the paper proposes a new reference - free evaluation framework - **LongDocFACTScore**, which is specifically designed to evaluate the factual consistency of long - document summaries. This framework improves existing methods in the following ways: - **Fine - grained Evaluation**: LongDocFACTScore calculates the document - level factual consistency score through fine - grained sentence - level evaluation. - **Efficient Expansion**: This framework can be efficiently extended to documents of any length without performance degradation due to overly long documents. - **Select the Most Relevant Parts**: Sentence embeddings and cosine similarity are used to select the most relevant parts in the source document for evaluation, improving the accuracy and efficiency of evaluation. In addition, the paper also creates a manually - annotated dataset named **LongSciVerify**, which contains fine - grained factual consistency annotations of long - document summaries from the scientific field to support the evaluation of automatic evaluation metrics. The creation of this dataset fills the gap in long - document summary evaluation resources. Overall, the paper aims to provide a more effective and accurate method for evaluating the factual consistency of long - document summaries and provides important resources and support for future research.