Abstract:Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics cannot be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights into the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. Through these annotations, we identify the proportion of different categories of factual errors in various summarization models and benchmark factuality metrics, showing their correlation with human judgment as well as their specific strengths and weaknesses.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that although modern abstract generation models can generate highly fluent texts, they often have the problem of factual inconsistency. Specifically, the paper points out: 1. **Lack of common benchmark**: Currently, various indicators used to evaluate the factuality of abstracts cannot be effectively compared due to the lack of a unified benchmark. 2. **Binary treatment of factuality**: Existing methods regard factuality as a binary concept, that is, an abstract is either factual or not, without providing a deeper analysis of error types. 3. **Insufficient factuality evaluation**: Common evaluation indicators based on n - gram overlap (such as BLEU, ROUGE, and METEOR) cannot effectively measure the factual correctness of abstracts and have a poor correlation with human judgment of factuality. To address these problems, the paper proposes the following solutions: - **Classification of factual errors**: The paper proposes a classification system of factual errors based on the framework semantics of linguistics and discourse analysis, which divides factual errors into multiple fine - grained categories, including predicate error (PredE), entity error (EntE), situational error (CircE), anaphora error (CorefE), discourse link error (LinkE), out - of - article - scope error (OutE), and grammar error (GramE). - **Dataset creation**: A large - scale human - annotated data has been collected through the crowdsourcing platform. These data cover abstracts from different abstract generation systems, including the CNN/DM and XSum datasets. - **Benchmark test**: The existing factuality evaluation indicators are benchmark - tested using the collected datasets, and the performance of these indicators in detecting different types of factual errors is evaluated. Through these methods, the paper aims to provide a comprehensive framework to analyze and evaluate the factuality of abstract generation systems more finely, thereby helping researchers and developers better understand and improve these systems.

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

Annotating and Modeling Fine-grained Factuality in Summarization

Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization

Evaluating the Tradeoff Between Abstractiveness and Factuality in Abstractive Summarization

Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

Evaluating Factuality in Cross-lingual Summarization

Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency

The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey

Using Similarity to Evaluate Factual Consistency in Summaries

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Improving Factuality of Abstractive Summarization via Contrastive Reward Learning

FAR-ASS: Fact-aware reinforced abstractive sentence summarization

QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

Evaluating the Factual Consistency of Large Language Models Through News Summarization

Faithful to the Original: Fact Aware Neural Abstractive Summarization

Evaluating Factuality in Text Simplification