Abstract:The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.

What problem does this paper attempt to address?

The main goal of this paper is to address the issue of enabling large language models (LLMs) to understand and process speech input, particularly how to achieve this without sacrificing their original text processing capabilities. Specifically, the paper proposes the Bootstrapping Language-Speech Pre-training via Behavior Alignment (BLSP) method, which aims to achieve effective integration of speech and text through behavior alignment. ### Main Issues Addressed in the Paper 1. **Cross-modal Alignment**: How to effectively combine the speech modality with the capabilities of large language models so that the model can understand both text and speech. This is an open problem in current research. 2. **Limitations of Existing Methods**: Existing methods can be divided into two categories—cascaded methods and end-to-end methods. Cascaded methods limit the interaction between speech and LLMs; end-to-end methods rely on scarce speech instruction data, making large-scale application difficult. 3. **Utilizing Existing Resources**: The paper explores whether it is possible to achieve general alignment between speech and text by utilizing existing cross-modal datasets (such as automatic speech recognition (ASR) data) without the need to collect new, specialized speech instruction data. ### Core Ideas of the BLSP Method - **Behavior Alignment**: Achieving effective alignment of speech and text by ensuring that the LLM exhibits the same behavior whether the input is a speech segment or its transcribed text. - **Lightweight Modality Adapter**: Introducing a lightweight modality adapter between the frozen speech encoder and the LLM, and optimizing it to ensure that the LLM exhibits consistent generative behavior regardless of the input modality. - **Continuation of Writing Behavior**: Focusing on the continuation of writing behavior, as this behavior is similar to extensive data training, capable of generating diverse text and avoiding overfitting to specific tasks. In summary, the key issue this paper attempts to address is the development of an effective method to extend the functionality of LLMs so that they can maintain their original language processing capabilities while handling speech input, and overcoming the limitations of existing methods.

BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation

Transferable speech-to-text large language model alignment module

BLSP-Emo: Towards Empathetic Large Speech-Language Models

SLM: Bridge the thin gap between speech and text foundation models

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

AlignFormer: Modality Matching Can Achieve Better Zero-shot Instruction-Following Speech-LLM

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

A Survey on Speech Large Language Models

Self-Powered LLM Modality Expansion for Large Speech-Text Models

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Global and Local Semantic Completion Learning for Vision-Language Pre-training

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition