Abstract:Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal and the role of summarisation in the workflow.

What problem does this paper attempt to address?

The paper aims to explore the effectiveness and reliability of automatic summarization evaluation metrics. Specifically: 1. **Research Background and Issues**: The evaluation of natural language processing systems is crucial to ensure their effectiveness and reliability in practical applications. Although human evaluation is considered the most reliable method for assessing natural language generation systems, automatic evaluation metrics are more commonly used due to cost-effectiveness, ease of use, reproducibility, and speed. 2. **Research Focus**: The paper particularly focuses on how to effectively evaluate the automatic summarization evaluation metrics themselves (i.e., meta-evaluation), pointing out that current research mainly concentrates on news summarization datasets and that the research focus is gradually shifting towards the faithfulness of generated summaries. The authors believe it is time to establish more diverse benchmarks to develop more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. 3. **Main Findings and Recommendations**: The authors find that the datasets currently used for meta-evaluation mostly come from the news domain, which limits the applicability of these evaluation metrics in other fields. Additionally, the lack of uniformity in the definition of quality dimensions and the inadequacies in the selection and training of human annotators further affect the reliability and comparability of evaluation results. Therefore, the authors call for the establishment of more diverse benchmarks and the standardization of human evaluation practices to improve the reproducibility and scalability of evaluation results.

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

SummEval: Re-evaluating Summarization Evaluation

Re-evaluating Evaluation in Text Summarization

Revisiting Automatic Question Summarization Evaluation in the Biomedical Domain

From task to evaluation: an automatic text summarization review

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

Metrics Also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

What's under the hood: Investigating Automatic Metrics on Meeting Summarization

A Comparative Study of Quality Evaluation Methods for Text Summarization

Revisiting Summarization Evaluation for Scientific Articles

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory

Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization

RCSUM: To build a summarization system directly generating summaries with evaluation metrics

Rethinking the Evaluation of Video Summaries