Abstract:Summarization of speech is a difficult problem due to the spontaneity of the flow, disfluencies, and other issues that are not usually encountered in written texts. Our work presents the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics, from gardening and cooking to software configuration and sports. In order to enrich the vocabulary, we use transfer learning and pretrain the model on a few large cross-domain datasets in both written and spoken English. We also do preprocessing of transcripts to restore sentence segmentation and punctuation in the output of an ASR system. The results are evaluated with ROUGE and Content-F1 scoring for the How2 and WikiHow datasets. We engage human judges to score a set of summaries randomly selected from a dataset curated from HowTo100M and YouTube. Based on blind evaluation, we achieve a level of textual fluency and utility close to that of summaries written by human content creators. The model beats current SOTA when applied to WikiHow articles that vary widely in style and topic, while showing no performance regression on the canonical CNN/DailyMail dataset. Due to the high generalizability of the model across different styles and domains, it has great potential to improve accessibility and discoverability of internet content. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the performance of automatic summarization tools, especially for user - generated online content, such as spoken - instruction videos on YouTube and written texts. Specifically, the researchers developed a BERT - based model for generating abstract summaries of narrative instructional videos (covering multiple topics from gardening, cooking to software configuration and sports, etc.). They solved this problem in the following ways: 1. **Pre - training on cross - domain datasets**: To enrich the vocabulary, the researchers used transfer learning and pre - trained the model on multiple large cross - domain datasets, which include both written and spoken English. 2. **Pre - processing of transcripts**: To restore sentence segmentation and punctuation in the output of the automatic speech recognition system, the researchers pre - processed the transcripts. 3. **Evaluation and improvement**: The researchers evaluated the results using ROUGE and Content - F1 scores, and invited human reviewers to score a randomly selected set of summaries through blind evaluation. The experimental results show that the model is close to the level of summaries written by human content creators in terms of text fluency and practicality. In addition, when applied to WikiHow articles with a wide range of styles and topics, the model outperforms the current state - of - the - art methods, and there is no performance degradation when dealing with the classic CNN/DailyMail dataset. 4. **Multimodal summarization**: Although this research mainly focuses on text summarization, the authors also mentioned the direction of future work, that is, exploring the application of these summary models to human - chatbot conversations to further expand their application scope. In conclusion, this paper aims to improve the accessibility and discoverability of Internet content by developing a general - purpose tool that can generate high - quality summaries across different domains.

Abstractive Summarization of Spoken and Written Instructions with BERT

T-BERTSum: Topic-Aware Text Summarization Based on BERT

Assessment of Transformer-Based Encoder-Decoder Model for Human-Like Summarization

Enhancing Semantic Understanding with Self-supervised Methods for Abstractive Dialogue Summarization

An Effective Contextual Language Modeling Framework for Speech Summarization with Augmented Features

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

Unified extractive-abstractive summarization: a hybrid approach utilizing BERT and transformer models for enhanced document summarization

Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism

Leveraging BERT for Extractive Text Summarization on Lectures

Fine-tune BERT for Extractive Summarization

RetrievalSum: A Retrieval Enhanced Framework for Abstractive Summarization

Automatic Summarization of Long Documents

Curriculum-Guided Abstractive Summarization

Efficient Two-stage Approach for Long Document Summarization

Topic-Aware Abstractive Text Summarization

See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization

Abstractive method-based Text Summarization using Bidirectional Long Short-Term Memory and Pointer Generator Mode

Balancing Lexical and Semantic Quality in Abstractive Summarization

Multimodal Abstractive Summarization for How2 Videos

Hierarchical Summarization for Longform Spoken Dialog

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models