Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Xiong Wang,Yangze Li,Chaoyou Fu,Lei Xie,Ke Li,Xing Sun,Long Ma
2024-11-02
Abstract:The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
Sound,Artificial Intelligence,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to address the problem of achieving low-latency speech-to-speech conversational capabilities while keeping the parameters of large language models (LLMs) unchanged. Specifically, the paper proposes a multimodal LLM architecture named Freeze-Omni, which can connect speech input and output modalities to the LLM without fine-tuning, thereby achieving low-latency speech conversation while maintaining the original intelligence level of the LLM. ### Main Issues and Challenges: 1. **Avoiding Catastrophic Forgetting**: In some existing multimodal LLMs, aligning the speech modality with the LLM usually requires fine-tuning the LLM. However, due to limited data and insufficient training resources, this fine-tuning can lead to "catastrophic forgetting," where the LLM forgets previously learned knowledge. 2. **Low Latency**: Traditional cascade methods (ASR + LLM + TTS) can achieve speech interaction but often result in high engineering complexity and long interaction delays. 3. **Performance Gap**: Existing multimodal LLMs generally perform worse on speech question-answering tasks compared to text question-answering tasks, showing a significant performance gap. ### Solutions: 1. **Freezing the LLM**: Freeze-Omni ensures that the intelligence level of the LLM is not affected by freezing its parameters. 2. **Three-Stage Training Strategy**: - **Stage 1**: Train the speech encoder using a large amount of ASR data to convert speech features into high-dimensional representations and map them to the LLM's embedding space through an adapter module. - **Stage 2**: Train the model with a small amount of question-answering data while keeping the LLM frozen, enabling it to handle speech input to text output. - **Stage 3**: Use multi-turn question-answering datasets to generate multi-turn answers and train the model further with speech modality data generated by a multi-speaker TTS system, enabling it to handle text input to speech output. 3. **Duplex Dialogue Design**: Achieve natural duplex speech conversation by implementing block state prediction through multitask training, allowing the model to interrupt or reject user input. ### Experimental Results: - **Speech Input Understanding Ability**: Freeze-Omni's ASR performance is outstanding across multiple evaluation sets, especially with the chunk=∞ decoding method in dynamic block training. - **Speech Output Quality**: In single-speaker scenarios, Freeze-Omni's synthesized speech has a low CER under different AR decoding parameters, indicating high speech output quality. - **Speech Question-Answering Accuracy**: Experimental results on three datasets show that Freeze-Omni's performance on speech question-answering tasks is close to that of the underlying LLM, validating its comparable intelligence level in both text and speech modalities. ### Conclusion: Freeze-Omni successfully achieves low-latency speech-to-speech conversational capabilities by freezing LLM parameters while maintaining the LLM's intelligence level. Future work will further explore more speech conversation capabilities, such as emotion understanding and multi-speaker synthesis.