Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

Artidoro Pagnoni,Vidhisha Balachandran,Yulia Tsvetkov
DOI: https://doi.org/10.48550/arXiv.2104.13346
2021-07-24
Abstract:Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics cannot be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights into the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. Through these annotations, we identify the proportion of different categories of factual errors in various summarization models and benchmark factuality metrics, showing their correlation with human judgment as well as their specific strengths and weaknesses.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that although modern abstract generation models can generate highly fluent texts, they often have the problem of factual inconsistency. Specifically, the paper points out: 1. **Lack of common benchmark**: Currently, various indicators used to evaluate the factuality of abstracts cannot be effectively compared due to the lack of a unified benchmark. 2. **Binary treatment of factuality**: Existing methods regard factuality as a binary concept, that is, an abstract is either factual or not, without providing a deeper analysis of error types. 3. **Insufficient factuality evaluation**: Common evaluation indicators based on n - gram overlap (such as BLEU, ROUGE, and METEOR) cannot effectively measure the factual correctness of abstracts and have a poor correlation with human judgment of factuality. To address these problems, the paper proposes the following solutions: - **Classification of factual errors**: The paper proposes a classification system of factual errors based on the framework semantics of linguistics and discourse analysis, which divides factual errors into multiple fine - grained categories, including predicate error (PredE), entity error (EntE), situational error (CircE), anaphora error (CorefE), discourse link error (LinkE), out - of - article - scope error (OutE), and grammar error (GramE). - **Dataset creation**: A large - scale human - annotated data has been collected through the crowdsourcing platform. These data cover abstracts from different abstract generation systems, including the CNN/DM and XSum datasets. - **Benchmark test**: The existing factuality evaluation indicators are benchmark - tested using the collected datasets, and the performance of these indicators in detecting different types of factual errors is evaluated. Through these methods, the paper aims to provide a comprehensive framework to analyze and evaluate the factuality of abstract generation systems more finely, thereby helping researchers and developers better understand and improve these systems.