AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation

Haoyi Qiu,Kung-Hsiang Huang,Jingnong Qu,Nanyun Peng
2024-10-04
Abstract:Ensuring factual consistency is crucial for natural language generation tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior works on evaluating factual consistency of summarization often take the entailment-based approaches that first generate perturbed (factual inconsistent) summaries and then train a classifier on the generated data to detect the factually inconsistencies during testing time. However, previous approaches generating perturbed summaries are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs). Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate our approach significantly outperforms previous systems on the AggreFact-SOTA benchmark, showcasing its efficacy in evaluating factuality of abstractive summarization.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the evaluation problem of maintaining factual consistency in generative summaries. Specifically, the author points out that the existing evaluation methods have two main problems when generating factually inconsistent samples for training: 1. **Lack of coherence**: The factually inconsistent samples generated by some methods are not semantically coherent enough, resulting in the low quality of these samples. 2. **Insufficient coverage of error types**: The samples generated by the existing methods cannot comprehensively cover various types of factual errors, which makes the trained model may not be able to effectively detect certain specific types of factual inconsistency. To solve these problems, the author proposes a new framework named AMRFACT. This framework utilizes Abstract Meaning Representation (AMR) to generate factually inconsistent samples that are coherent and cover multiple error types. Through this method, AMRFACT can generate high - quality negative samples, thereby improving the accuracy of factual consistency evaluation. ### Main contributions 1. **Propose the AMRFACT framework**: Use AMR to generate factually inconsistent summaries, ensuring that the generated samples are both coherent and cover multiple error types. 2. **Design the data validation module NEGFILTER**: Filter out invalid negative samples to further improve the quality of the generated data. 3. **Achieve the best performance on the AGGRE FACT - FTSOTA benchmark**: The experimental results show that AMRFACT is significantly superior to the existing methods in evaluating the factual consistency of generative summaries. ### Method overview 1. **AMR parsing**: Parse the factually consistent summary into an AMR graph. 2. **Introduce factual errors**: Inject common factual errors into the AMR graph to generate a new AMR graph. 3. **Back - translation**: Convert the modified AMR graph back into a natural - language summary as a negative sample. 4. **Data screening**: Use the NEGFILTER module to screen out valid negative samples. 5. **Train a classifier**: Combine positive samples and negative samples to train a RoBERTa - based classifier for evaluating the factual consistency of summaries. ### Error types AMRFACT targets five common types of factual errors: - **Predicate error**: The predicate is inconsistent with the information in the source document. - **Entity error**: The entity or attribute related to the predicate is wrong. - **Context error**: The contextual information (such as place, time, modality) about the predicate interaction is wrong. - **Discourse - link error**: The logical connection between statements in the summary is wrong. - **Out - of - article - scope error**: The summary contains information not mentioned in the source document. ### Experimental setup - **Training data set**: Use the training set of the CNN/DM corpus to generate negative samples. - **Evaluation data set**: Evaluate on the AGGRE FACT - FTSOTA benchmark, which integrates multiple existing data sets to compare the performance of different evaluation systems in a more fine - grained manner. ### Experimental results - **Performance comparison**: The balanced binary accuracy of AMRFACT on the AGGRE FACT - FTSOTA test set is significantly higher than that of other methods, especially on the CNN/DM subset. Through these contributions, AMRFACT provides a more effective method to evaluate the factual consistency of generative summaries, which helps to improve the quality of natural - language - generation tasks.