Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

Yiming Wang,Zhuosheng Zhang,Rui Wang

2023-05-23

Abstract:Automatic summarization generates concise summaries that contain key ideas of source documents. As the most mainstream datasets for the news sub-domain, CNN/DailyMail and BBC XSum have been widely used for performance benchmarking. However, the reference summaries of those datasets turn out to be noisy, mainly in terms of factual hallucination and information redundancy. To address this challenge, we first annotate new expert-writing Element-aware test sets following the "Lasswell Communication Model" proposed by Lasswell (1948), allowing reference summaries to focus on more fine-grained news elements objectively and comprehensively. Utilizing the new test sets, we observe the surprising zero-shot summary ability of LLMs, which addresses the issue of the inconsistent results between human preference and automatic evaluation metrics of LLMs' zero-shot summaries in prior work. Further, we propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step, which helps them integrate more fine-grained details of source documents into the final summaries that correlate with the human writing mindset. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L on the two datasets, respectively. Dataset and code are publicly available at <a class="link-external link-https" href="https://github.com/Alsace08/SumCoT" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

### The Problem This Paper Attempts to Solve This paper aims to address two main issues in automatic summarization: 1. **Factual Hallucination**: Existing datasets such as CNN/DailyMail and BBC XSum contain factual errors. 2. **Information Redundancy**: The reference summaries in these datasets have repetitive content. To solve these problems, the authors propose the following methods: - **Constructing a new expert-written Element-aware test set**: Following the Lasswell communication model, ensuring that the summaries can objectively and comprehensively cover the core elements of the news. - **Proposing the Summary Chain-of-Thought (SumCoT) technique**: Gradually guiding large language models (LLMs) to generate summaries, thereby improving the quality of the summaries. Experimental results show that using the SumCoT technique, LLMs significantly outperform existing pre-trained models (PLMs) on automatic evaluation metrics such as ROUGE-L.

Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale

Benchmarking Large Language Models for News Summarization

On Learning to Summarize with Large Language Models as References

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

Dual-Level Contrastive Learning for Improving Conciseness of Summarization

LLM Based Multi-Document Summarization Exploiting Main-Event Biased Monotone Submodular Content Extraction

Controllable Multi-document Summarization: Coverage & Coherence Intuitive Policy with Large Language Model Based Rewards

Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization

Towards a Robust Retrieval-Based Summarization System

Learning to Summarize from LLM-generated Feedback

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Balancing Lexical and Semantic Quality in Abstractive Summarization

Source Code Summarization in the Era of Large Language Models

An End-to-End Speech Summarization Using Large Language Model

Large-Scale Multi-Document Summarization with Information Extraction and Compression

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

On Context Utilization in Summarization with Large Language Models

Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs