Abstract:The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.

What problem does this paper attempt to address?

This paper attempts to address the problem of achieving low-latency speech-to-speech conversational capabilities while keeping the parameters of large language models (LLMs) unchanged. Specifically, the paper proposes a multimodal LLM architecture named Freeze-Omni, which can connect speech input and output modalities to the LLM without fine-tuning, thereby achieving low-latency speech conversation while maintaining the original intelligence level of the LLM. ### Main Issues and Challenges: 1. **Avoiding Catastrophic Forgetting**: In some existing multimodal LLMs, aligning the speech modality with the LLM usually requires fine-tuning the LLM. However, due to limited data and insufficient training resources, this fine-tuning can lead to "catastrophic forgetting," where the LLM forgets previously learned knowledge. 2. **Low Latency**: Traditional cascade methods (ASR + LLM + TTS) can achieve speech interaction but often result in high engineering complexity and long interaction delays. 3. **Performance Gap**: Existing multimodal LLMs generally perform worse on speech question-answering tasks compared to text question-answering tasks, showing a significant performance gap. ### Solutions: 1. **Freezing the LLM**: Freeze-Omni ensures that the intelligence level of the LLM is not affected by freezing its parameters. 2. **Three-Stage Training Strategy**: - **Stage 1**: Train the speech encoder using a large amount of ASR data to convert speech features into high-dimensional representations and map them to the LLM's embedding space through an adapter module. - **Stage 2**: Train the model with a small amount of question-answering data while keeping the LLM frozen, enabling it to handle speech input to text output. - **Stage 3**: Use multi-turn question-answering datasets to generate multi-turn answers and train the model further with speech modality data generated by a multi-speaker TTS system, enabling it to handle text input to speech output. 3. **Duplex Dialogue Design**: Achieve natural duplex speech conversation by implementing block state prediction through multitask training, allowing the model to interrupt or reject user input. ### Experimental Results: - **Speech Input Understanding Ability**: Freeze-Omni's ASR performance is outstanding across multiple evaluation sets, especially with the chunk=∞ decoding method in dynamic block training. - **Speech Output Quality**: In single-speaker scenarios, Freeze-Omni's synthesized speech has a low CER under different AR decoding parameters, indicating high speech output quality. - **Speech Question-Answering Accuracy**: Experimental results on three datasets show that Freeze-Omni's performance on speech question-answering tasks is close to that of the underlying LLM, validating its comparable intelligence level in both text and speech modalities. ### Conclusion: Freeze-Omni successfully achieves low-latency speech-to-speech conversational capabilities by freezing LLM parameters while maintaining the LLM's intelligence level. Future work will further explore more speech conversation capabilities, such as emotion understanding and multi-speaker synthesis.

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Application of frozen large-scale models to multimodal task-oriented dialogue

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

VITA: Towards Open-Source Interactive Omni Multimodal LLM

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

A Full-duplex Speech Dialogue Scheme Based On Large Language Models

OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

OmniDialog: An Omnipotent Pre-training Model for Task-Oriented Dialogue System

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

AudioPaLM: A Large Language Model That Can Speak and Listen