On the Role of Summary Content Units in Text Summarization Evaluation

Marcel Nawrath,Agnieszka Nowak,Tristan Ratz,Danilo C. Walenta,Juri Opitz,Leonardo F. R. Ribeiro,João Sedoc,Daniel Deutsch,Simon Mille,Yixin Liu,Lining Zhang,Sebastian Gehrmann,Saad Mahamood,Miruna Clinciu,Khyathi Chandu,Yufang Hou

2024-04-02

Abstract:At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs are concise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages? ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategies to approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when ranking short summaries, but may not help as much when ranking systems or longer summaries.

Computation and Language

What problem does this paper attempt to address?

The paper explores the core issues in text summary evaluation, especially how to effectively approximate the Summary Content Units (SCUs) written by humans. SCUs are concise sentences that constitute the reference summary and are used to break down the main facts of the summary. The paper proposes that SCUs can be used to assess the quality of generated summaries and may even be partially automated through Natural Language Inference (NLI) systems. In the study, the authors propose two new SCU approximation strategies: Semantic Meaning Units (SMUs) based on Abstract Meaning Representation (AMR) and Semantic GPT Units (SGUs) generated using large language models (LLMs). The experiments show that while Semantic Triple Units (STUs) and SMUs perform competitively, SGUs provide the best approximation quality. However, for short summaries ranking, the simple sentence splitting baseline (SSUs) competes well with SCUs, and for longer summaries or system ranking, the advantage of SCUs is not significant. The paper also finds that SCUs and their approximation methods are particularly valuable for summary-level evaluation in different scenarios, especially when dealing with shorter summaries. Nevertheless, for system comparison or evaluation of longer summaries, simple sentence splitting methods may suffice. Furthermore, human-created SCUs and automated SGUs show similar high quality in human evaluation.

On the Role of Summary Content Units in Text Summarization Evaluation

QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization

What Have We Achieved on Text Summarization?

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

RCSUM: To build a summarization system directly generating summaries with evaluation metrics

Human-like Summarization Evaluation with ChatGPT

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Text Summarization Based on Sentence Selection with Semantic Representation

Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization

On the Evaluation of Neural Code Summarization

Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

SgSum: Transforming Multi-document Summarization into Sub-graph Selection

Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A Practical Study

EDU-level Extractive Summarization with Varying Summary Lengths

Sentiment Lossless Summarization

Automatic text summarization based on sentences clustering and extraction

AsU-OSum: Aspect-augmented unsupervised opinion summarization

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization

Rouge-C: A Fully Automated Evaluation Method for Multi-Document Summarization