On the Role of Summary Content Units in Text Summarization Evaluation

Marcel Nawrath,Agnieszka Nowak,Tristan Ratz,Danilo C. Walenta,Juri Opitz,Leonardo F. R. Ribeiro,João Sedoc,Daniel Deutsch,Simon Mille,Yixin Liu,Lining Zhang,Sebastian Gehrmann,Saad Mahamood,Miruna Clinciu,Khyathi Chandu,Yufang Hou
2024-04-02
Abstract:At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs). These SCUs are concise sentences that decompose a summary into small facts. Such SCUs can be used to judge the quality of a candidate summary, possibly partially automated via natural language inference (NLI) systems. Interestingly, with the aim to fully automate the Pyramid evaluation, Zhang and Bansal (2021) show that SCUs can be approximated by automatically generated semantic role triplets (STUs). However, several questions currently lack answers, in particular: i) Are there other ways of approximating SCUs that can offer advantages? ii) Under which conditions are SCUs (or their approximations) offering the most value? In this work, we examine two novel strategies to approximate SCUs: generating SCU approximations from AMR meaning representations (SMUs) and from large language models (SGUs), respectively. We find that while STUs and SMUs are competitive, the best approximation quality is achieved by SGUs. We also show through a simple sentence-decomposition baseline (SSUs) that SCUs (and their approximations) offer the most value when ranking short summaries, but may not help as much when ranking systems or longer summaries.
Computation and Language
What problem does this paper attempt to address?
The paper explores the core issues in text summary evaluation, especially how to effectively approximate the Summary Content Units (SCUs) written by humans. SCUs are concise sentences that constitute the reference summary and are used to break down the main facts of the summary. The paper proposes that SCUs can be used to assess the quality of generated summaries and may even be partially automated through Natural Language Inference (NLI) systems. In the study, the authors propose two new SCU approximation strategies: Semantic Meaning Units (SMUs) based on Abstract Meaning Representation (AMR) and Semantic GPT Units (SGUs) generated using large language models (LLMs). The experiments show that while Semantic Triple Units (STUs) and SMUs perform competitively, SGUs provide the best approximation quality. However, for short summaries ranking, the simple sentence splitting baseline (SSUs) competes well with SCUs, and for longer summaries or system ranking, the advantage of SCUs is not significant. The paper also finds that SCUs and their approximation methods are particularly valuable for summary-level evaluation in different scenarios, especially when dealing with shorter summaries. Nevertheless, for system comparison or evaluation of longer summaries, simple sentence splitting methods may suffice. Furthermore, human-created SCUs and automated SGUs show similar high quality in human evaluation.