CASPR: Automated Evaluation Metric for Contrastive Summarization

Nirupan Ananthamurugan,Dat Duong,Philip George,Ankita Gupta,Sandeep Tata,Beliz Gunel
2024-05-14
Abstract:Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to reliably automate the evaluation of the contrast between summaries in the contrastive summarization task without relying on human evaluation. Specifically, existing evaluation methods such as Distinctiveness Score (DS) mainly measure contrast based on lexical overlap, and this method cannot handle well the lexical changes while maintaining semantics, so it has great limitations in practical applications. For this reason, the paper proposes a new automated evaluation metric - CASPR, aiming to measure more accurately the logical contrast relationship between a pair of summaries. ### Background of the paper - **Contrastive summarization**: Extract contrastive opinions about entities (such as hotels, mobile phones, etc.) from a set of source reviews and generate summaries containing these contrastive opinions to help users make decisions. - **Existing problems**: Currently, there is a lack of reliable automated evaluation metrics to measure the contrast in contrastive summaries, which hinders further research progress because large - scale human evaluation is required. ### Limitations of existing methods - **Distinctiveness Score (DS)**: Measures contrast based on lexical overlap, but does not consider lexical changes while maintaining semantics, resulting in inaccurate evaluation results in some cases. - **Inverted BERTScore (BS−1)**: Although more robust to lexical changes, it is insensitive to logical relationships at the sentence level and cannot detect logical contradictions. ### Design and implementation of CASPR - **Natural Language Inference (NLI)**: Use the NLI task to compare the logical relationships between two sentences and determine the inference relationships between them (entailment, contradiction, neutral). - **Single - sentence splitting**: Split complex multi - claim sentences into multiple single - claim sentences in order to evaluate logical relationships more precisely. - **Scoring and aggregation**: - Conduct NLI evaluation on each pair of single sentences and assign scores according to their labels (entailment, contradiction, neutral). - Aggregate the scores of all sentence pairs to finally obtain a summary - level contrast score. ### Experimental verification - **Dataset**: Use the COCOTRIP dataset, which contains multiple pairs of contrastive summaries to evaluate the effects of different methods. - **Experimental setup**: Construct multiple synthetic datasets, including logical negation (Synthetic High Contrast) and paraphrasing (Synthetic Low Contrast), to test the performance of CASPR. - **Results**: CASPR scores close to 100 on the synthetic high - contrast dataset and close to 0 on the synthetic low - contrast dataset, showing better discrimination ability than DS and BS−1. ### Conclusion The paper proposes a new automated evaluation metric CASPR. Through natural language inference and single - sentence splitting techniques, it can measure more accurately the logical contrast relationships in contrastive summaries and overcome the limitations of existing methods. The experimental results show that CASPR performs excellently on multiple datasets and has high practical value.