Abstract:Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to reliably automate the evaluation of the contrast between summaries in the contrastive summarization task without relying on human evaluation. Specifically, existing evaluation methods such as Distinctiveness Score (DS) mainly measure contrast based on lexical overlap, and this method cannot handle well the lexical changes while maintaining semantics, so it has great limitations in practical applications. For this reason, the paper proposes a new automated evaluation metric - CASPR, aiming to measure more accurately the logical contrast relationship between a pair of summaries. ### Background of the paper - **Contrastive summarization**: Extract contrastive opinions about entities (such as hotels, mobile phones, etc.) from a set of source reviews and generate summaries containing these contrastive opinions to help users make decisions. - **Existing problems**: Currently, there is a lack of reliable automated evaluation metrics to measure the contrast in contrastive summaries, which hinders further research progress because large - scale human evaluation is required. ### Limitations of existing methods - **Distinctiveness Score (DS)**: Measures contrast based on lexical overlap, but does not consider lexical changes while maintaining semantics, resulting in inaccurate evaluation results in some cases. - **Inverted BERTScore (BS−1)**: Although more robust to lexical changes, it is insensitive to logical relationships at the sentence level and cannot detect logical contradictions. ### Design and implementation of CASPR - **Natural Language Inference (NLI)**: Use the NLI task to compare the logical relationships between two sentences and determine the inference relationships between them (entailment, contradiction, neutral). - **Single - sentence splitting**: Split complex multi - claim sentences into multiple single - claim sentences in order to evaluate logical relationships more precisely. - **Scoring and aggregation**: - Conduct NLI evaluation on each pair of single sentences and assign scores according to their labels (entailment, contradiction, neutral). - Aggregate the scores of all sentence pairs to finally obtain a summary - level contrast score. ### Experimental verification - **Dataset**: Use the COCOTRIP dataset, which contains multiple pairs of contrastive summaries to evaluate the effects of different methods. - **Experimental setup**: Construct multiple synthetic datasets, including logical negation (Synthetic High Contrast) and paraphrasing (Synthetic Low Contrast), to test the performance of CASPR. - **Results**: CASPR scores close to 100 on the synthetic high - contrast dataset and close to 0 on the synthetic low - contrast dataset, showing better discrimination ability than DS and BS−1. ### Conclusion The paper proposes a new automated evaluation metric CASPR. Through natural language inference and single - sentence splitting techniques, it can measure more accurately the logical contrast relationships in contrastive summaries and overcome the limitations of existing methods. The experimental results show that CASPR performs excellently on multiple datasets and has high practical value.

CASPR: Automated Evaluation Metric for Contrastive Summarization

Building Contrastive Summaries of Subjective Text Via Opinion Ranking

Evaluating Code Summarization with Improved Correlation with Human Assessment.

Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning

Dual-Level Contrastive Learning for Improving Conciseness of Summarization

Revisiting Summarization Evaluation for Scientific Articles

Comparative Opinion Summarization via Collaborative Decoding

Sequence Level Contrastive Learning for Text Summarization

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

Sentence salience contrastive learning for abstractive text summarization

DCDSum: An interpretable extractive summarization framework based on contrastive learning method

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Towards Dataset-scale and Feature-oriented Evaluation of Text Summarization in Large Language Model Prompts

Comparative Study and Framework for Automated Summariser Evaluation: LangChain and Hybrid Algorithms

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

A Comparative Study of Quality Evaluation Methods for Text Summarization

SummEval: Re-evaluating Summarization Evaluation

STRUM-LLM: Attributed and Structured Contrastive Summarization

Using Similarity to Evaluate Factual Consistency in Summaries

ConVerSum: A Contrastive Learning based Approach for Data-Scarce Solution of Cross-Lingual Summarization Beyond Direct Equivalents