Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

Kohei Matsuura,Takanori Ashihara,Takafumi Moriya,Tomohiro Tanaka,Takatomo Kano,Atsunori Ogawa,Marc Delcroix

2023-06-07

Abstract:End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. To overcome this drawback, we propose for the first time to integrate a pre-trained language model (LM), which is highly capable of generating natural sentences, into the E2E SSum decoder via transfer learning. In addition, to reduce the gap between the independently pre-trained encoder and decoder, we also propose to transfer the baseline E2E SSum encoder instead of the commonly used automatic speech recognition encoder. Experimental results show that the proposed model outperforms baseline and data augmented models.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The paper proposes improvements to address the issues of scarce training data and unnatural summary generation faced by End-to-End Speech Summarization (E2E SSum) technology. Specifically, E2E SSum technology generates concise and readable summary sentences directly from speech. Compared to traditional cascade methods, it can fully utilize non-verbal information in speech and mitigate the impact of errors from Automatic Speech Recognition (ASR). However, E2E SSum models require a large amount of speech-summary pairs for training, which is often difficult to obtain in practical applications, leading to insufficient training data and low-quality summaries. To solve the above problems, the authors propose a new method that integrates a Pre-Trained Language Model (LM) into the decoder of E2E SSum through transfer learning. This method aims to leverage the powerful text generation capabilities of pre-trained language models to improve the ability of E2E SSum models to generate natural and fluent summaries. Additionally, to reduce the gap between the encoder and decoder, the authors suggest transferring the baseline E2E SSum encoder instead of the commonly used ASR encoder. Experimental results show that the proposed model outperforms the baseline model and text-to-speech (TTS) data augmentation methods on multiple metrics, with a significant improvement in METEOR scores. This demonstrates the effectiveness of integrating pre-trained language models into E2E SSum and the importance of initializing with the E2E SSum encoder.

Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

Leveraging Large Text Corpora for End-to-End Speech Summarization

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

Towards End-to-end Speech-to-text Summarization

Combining Temporal Event Relations and Pre-Trained Language Models for Text Summarization

An End-to-End Speech Summarization Using Large Language Model

Abstractive summarization from Audio Transcription

Prompting Large Language Models with Audio for General-Purpose Speech Summarization

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Assessment of Transformer-Based Encoder-Decoder Model for Human-Like Summarization

Efficient Adaptation of Pretrained Transformers for Abstractive Summarization

Q-learning with Language Model for Edit-based Unsupervised Summarization

SPECTRUM: Speaker-Enhanced Pre-Training for Long Dialogue Summarization

Knowledge Transfer from Large-Scale Pretrained Language Models to End-To-End Speech Recognizers

Leverage Unlabeled Data for Abstractive Speech Summarization with Self-Supervised Learning and Back-Summarization

ESSumm: Extractive Speech Summarization from Untranscribed Meeting

Attention-based Multi-hypothesis Fusion for Speech Summarization

AugSumm: towards generalizable speech summarization using synthetic labels from large language model

Benchmarking Large Language Models for News Summarization

Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks