Topic-Centric Unsupervised Multi-Document Summarization of Scientific and News Articles

Amanuel Alambo,Cori Lohstroh,Erik Madaus,Swati Padhee,Brandy Foster,Tanvi Banerjee,Krishnaprasad Thirunarayan,Michael Raymer
DOI: https://doi.org/10.48550/arXiv.2011.08072
2020-11-03
Abstract:Recent advances in natural language processing have enabled automation of a wide range of tasks, including machine translation, named entity recognition, and sentiment analysis. Automated summarization of documents, or groups of documents, however, has remained elusive, with many efforts limited to extraction of keywords, key phrases, or key sentences. Accurate abstractive summarization has yet to be achieved due to the inherent difficulty of the problem, and limited availability of training data. In this paper, we propose a topic-centric unsupervised multi-document summarization framework to generate extractive and abstractive summaries for groups of scientific articles across 20 Fields of Study (FoS) in Microsoft Academic Graph (MAG) and news articles from DUC-2004 Task 2. The proposed algorithm generates an abstractive summary by developing salient language unit selection and text generation techniques. Our approach matches the state-of-the-art when evaluated on automated extractive evaluation metrics and performs better for abstractive summarization on five human evaluation metrics (entailment, coherence, conciseness, readability, and grammar). We achieve a kappa score of 0.68 between two co-author linguists who evaluated our results. We plan to publicly share MAG-20, a human-validated gold standard dataset of topic-clustered research articles and their summaries to promote research in abstractive summarization.
Computation and Language,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the field of natural language processing, the technology for automatically generating multi - document summaries (especially abstractive summaries) is still not mature enough. Although extractive summaries have made certain progress, due to the inherent difficulty of abstractive summaries and the limitation of training data, it is still difficult to achieve high - precision generation so far. Specifically, this paper aims to propose an unsupervised multi - document summarization framework based on topics to generate extractive and abstractive summaries of scientific articles and news articles. This framework is specifically optimized for scientific articles in the Microsoft Academic Graph (MAG) of 20 research fields and news articles in DUC - 2004 Task 2. The main contributions of the paper include: 1. Proposing a technique for generating Abstract Language Units (ALU) using GPT - 2; 2. Developing a new algorithm based on Multi - Sentence Compression (MSC) for selecting informative paths; 3. Creating a gold - standard dataset containing articles clustered by topic from MAG - 20 and their multi - document abstractive summaries. Through these methods, the paper hopes to make progress in automatically generating high - quality multi - document summaries, especially in improving the coherence, conciseness, readability and grammatical correctness of the summaries.