BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

Chen Wang,Minpeng Liao,Zhongqiang Huang,Jinliang Lu,Junhong Wu,Yuchen Liu,Chengqing Zong,Jiajun Zhang
2024-05-28
Abstract:The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main goal of this paper is to address the issue of enabling large language models (LLMs) to understand and process speech input, particularly how to achieve this without sacrificing their original text processing capabilities. Specifically, the paper proposes the Bootstrapping Language-Speech Pre-training via Behavior Alignment (BLSP) method, which aims to achieve effective integration of speech and text through behavior alignment. ### Main Issues Addressed in the Paper 1. **Cross-modal Alignment**: How to effectively combine the speech modality with the capabilities of large language models so that the model can understand both text and speech. This is an open problem in current research. 2. **Limitations of Existing Methods**: Existing methods can be divided into two categories—cascaded methods and end-to-end methods. Cascaded methods limit the interaction between speech and LLMs; end-to-end methods rely on scarce speech instruction data, making large-scale application difficult. 3. **Utilizing Existing Resources**: The paper explores whether it is possible to achieve general alignment between speech and text by utilizing existing cross-modal datasets (such as automatic speech recognition (ASR) data) without the need to collect new, specialized speech instruction data. ### Core Ideas of the BLSP Method - **Behavior Alignment**: Achieving effective alignment of speech and text by ensuring that the LLM exhibits the same behavior whether the input is a speech segment or its transcribed text. - **Lightweight Modality Adapter**: Introducing a lightweight modality adapter between the frozen speech encoder and the LLM, and optimizing it to ensure that the LLM exhibits consistent generative behavior regardless of the input modality. - **Continuation of Writing Behavior**: Focusing on the continuation of writing behavior, as this behavior is similar to extensive data training, capable of generating diverse text and avoiding overfitting to specific tasks. In summary, the key issue this paper attempts to address is the development of an effective method to extend the functionality of LLMs so that they can maintain their original language processing capabilities while handling speech input, and overcoming the limitations of existing methods.