Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

Kohei Matsuura,Takanori Ashihara,Takafumi Moriya,Tomohiro Tanaka,Takatomo Kano,Atsunori Ogawa,Marc Delcroix
2023-06-07
Abstract:End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. To overcome this drawback, we propose for the first time to integrate a pre-trained language model (LM), which is highly capable of generating natural sentences, into the E2E SSum decoder via transfer learning. In addition, to reduce the gap between the independently pre-trained encoder and decoder, we also propose to transfer the baseline E2E SSum encoder instead of the commonly used automatic speech recognition encoder. Experimental results show that the proposed model outperforms baseline and data augmented models.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The paper proposes improvements to address the issues of scarce training data and unnatural summary generation faced by End-to-End Speech Summarization (E2E SSum) technology. Specifically, E2E SSum technology generates concise and readable summary sentences directly from speech. Compared to traditional cascade methods, it can fully utilize non-verbal information in speech and mitigate the impact of errors from Automatic Speech Recognition (ASR). However, E2E SSum models require a large amount of speech-summary pairs for training, which is often difficult to obtain in practical applications, leading to insufficient training data and low-quality summaries. To solve the above problems, the authors propose a new method that integrates a Pre-Trained Language Model (LM) into the decoder of E2E SSum through transfer learning. This method aims to leverage the powerful text generation capabilities of pre-trained language models to improve the ability of E2E SSum models to generate natural and fluent summaries. Additionally, to reduce the gap between the encoder and decoder, the authors suggest transferring the baseline E2E SSum encoder instead of the commonly used ASR encoder. Experimental results show that the proposed model outperforms the baseline model and text-to-speech (TTS) data augmentation methods on multiple metrics, with a significant improvement in METEOR scores. This demonstrates the effectiveness of integrating pre-trained language models into E2E SSum and the importance of initializing with the E2E SSum encoder.