Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles

Amanuel Alambo,Cori Lohstroh,Erik Madaus,Swati Padhee,Brandy Foster,Tanvi Banerjee,Krishnaprasad Thirunarayan,Michael Raymer

DOI: https://doi.org/10.48550/arXiv.2011.08072

2020-11-03

Abstract:Recent advances in natural language processing have enabled automation of a wide range of tasks, including machine translation, named entity recognition, and sentiment analysis. Automated summarization of documents, or groups of documents, however, has remained elusive, with many efforts limited to extraction of keywords, key phrases, or key sentences. Accurate abstractive summarization has yet to be achieved due to the inherent difficulty of the problem, and limited availability of training data. In this paper, we propose a topic-centric unsupervised multi-document summarization framework to generate extractive and abstractive summaries for groups of scientific articles across 20 Fields of Study (FoS) in Microsoft Academic Graph (MAG) and news articles from DUC-2004 Task 2. The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques. Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics (entailment, coherence, conciseness, readability, and grammar). We achieve a kappa score of 0.68 between two co-author linguists who evaluated our results. We plan to publicly share MAG-20, a human-validated gold standard dataset of topic-clustered research articles and their summaries to promote research in abstractive summarization.

Computation and Language,Information Retrieval,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the field of natural language processing, the technology for automatically generating multi - document summaries (especially abstractive summaries) is still not mature enough. Although extractive summaries have made certain progress, due to the inherent difficulty of abstractive summaries and the limitation of training data, it is still difficult to achieve high - precision generation so far. Specifically, this paper aims to propose an unsupervised multi - document summarization framework based on topics to generate extractive and abstractive summaries of scientific articles and news articles. This framework is specifically optimized for scientific articles in the Microsoft Academic Graph (MAG) of 20 research fields and news articles in DUC - 2004 Task 2. The main contributions of the paper include: 1. Proposing a technique for generating Abstract Language Units (ALU) using GPT - 2; 2. Developing a new algorithm based on Multi - Sentence Compression (MSC) for selecting informative paths; 3. Creating a gold - standard dataset containing articles clustered by topic from MAG - 20 and their multi - document abstractive summaries. Through these methods, the paper hopes to make progress in automatically generating high - quality multi - document summaries, especially in improving the coherence, conciseness, readability and grammatical correctness of the summaries.

Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles

SKT5SciSumm -- Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization

Synthesizing Scientific Summaries: An Extractive and Abstractive Approach

Topic-Aware Abstractive Text Summarization

Integrating Topic-Aware Heterogeneous Graph Neural Network With Transformer Model for Medical Scientific Document Abstractive Summarization

A Supervised Approach to Extractive Summarisation of Scientific Papers

Scientific document summarization via citation contextualization and scientific discourse

Topic-Guided Abstractive Text Summarization: a Joint Learning Approach

Personalized Summarization of Scientific Scholarly Texts

uMedSum: A Unified Framework for Advancing Medical Abstractive Summarization

GATSum: Graph-Based Topic-Aware Abstract Text Summarization

Data-driven Summarization of Scientific Articles

Abstract Meaning Representation for Multi-Document Summarization

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

A New Approach for Multi-Document Update Summarization

Summaformers @ LaySumm 20, LongSumm 20

What is in the news on a subject : automatic and sparse summarization of large document corpora

Multi-view multi-objective clustering-based framework for scientific document summarization using citation context

LLM Based Multi-Document Summarization Exploiting Main-Event Biased Monotone Submodular Content Extraction

I Want This, Not That: Personalized Summarization of Scientific Scholarly Texts

A Topic-aware Summarization Framework with Different Modal Side Information