Abstractive Summarization of Spoken and Written Instructions with BERT

Alexandra Savelieva,Bryan Au-Yeung,Vasanth Ramani
DOI: https://doi.org/10.48550/arXiv.2008.09676
2020-08-27
Abstract:Summarization of speech is a difficult problem due to the spontaneity of the flow, disfluencies, and other issues that are not usually encountered in written texts. Our work presents the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics, from gardening and cooking to software configuration and sports. In order to enrich the vocabulary, we use transfer learning and pretrain the model on a few large cross-domain datasets in both written and spoken English. We also do preprocessing of transcripts to restore sentence segmentation and punctuation in the output of an ASR system. The results are evaluated with ROUGE and Content-F1 scoring for the How2 and WikiHow datasets. We engage human judges to score a set of summaries randomly selected from a dataset curated from HowTo100M and YouTube. Based on blind evaluation, we achieve a level of textual fluency and utility close to that of summaries written by human content creators. The model beats current SOTA when applied to WikiHow articles that vary widely in style and topic, while showing no performance regression on the canonical CNN/DailyMail dataset. Due to the high generalizability of the model across different styles and domains, it has great potential to improve accessibility and discoverability of internet content. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of automatic summarization tools, especially for user - generated online content, such as spoken - instruction videos on YouTube and written texts. Specifically, the researchers developed a BERT - based model for generating abstract summaries of narrative instructional videos (covering multiple topics from gardening, cooking to software configuration and sports, etc.). They solved this problem in the following ways: 1. **Pre - training on cross - domain datasets**: To enrich the vocabulary, the researchers used transfer learning and pre - trained the model on multiple large cross - domain datasets, which include both written and spoken English. 2. **Pre - processing of transcripts**: To restore sentence segmentation and punctuation in the output of the automatic speech recognition system, the researchers pre - processed the transcripts. 3. **Evaluation and improvement**: The researchers evaluated the results using ROUGE and Content - F1 scores, and invited human reviewers to score a randomly selected set of summaries through blind evaluation. The experimental results show that the model is close to the level of summaries written by human content creators in terms of text fluency and practicality. In addition, when applied to WikiHow articles with a wide range of styles and topics, the model outperforms the current state - of - the - art methods, and there is no performance degradation when dealing with the classic CNN/DailyMail dataset. 4. **Multimodal summarization**: Although this research mainly focuses on text summarization, the authors also mentioned the direction of future work, that is, exploring the application of these summary models to human - chatbot conversations to further expand their application scope. In conclusion, this paper aims to improve the accessibility and discoverability of Internet content by developing a general - purpose tool that can generate high - quality summaries across different domains.