EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Jun Yao,Lanqing Hong,Lu Hou,Hang Xu
2024-10-29
Abstract:GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to construct an end - to - end multimodal large language model (LLM) that can process data in three modalities: vision, text, and speech, support voice conversations with vivid emotions, and maintain state - of - the - art performance on vision - language tasks. Specifically: 1. **Multimodal data processing**: Existing multimodal large language models can usually only process data in two modalities, such as vision - language or speech - language. How to effectively endow large language models with the ability to process data in three modalities: vision, text, and speech in an end - to - end manner remains an open problem. 2. **Speech generation and understanding**: Existing multimodal large language models rely on external tools (such as TTS tools) in speech generation, which limits their real - time interaction capabilities. In addition, these models also have deficiencies in speech understanding, especially performing poorly in high - resolution image processing. 3. **Emotion control**: Existing work has not yet explored how to achieve flexible control of voice styles (such as emotion, intonation) in large language models, which is very important for human - machine conversations in real life. To solve these problems, the paper proposes EMOVA (Emotionally Omni - present Voice Assistant), which is a new end - to - end multimodal large language model with the following characteristics: - **Visual encoder**: Use a continuous visual encoder to capture fine - grained visual details. - **Semantic - acoustic decoupled speech tokenizer**: Convert the input speech waveform into discrete speech units and integrate them seamlessly with the large language model while supporting diverse voice - style control. - **Lightweight style module**: Introduce a lightweight style module to support voice conversations with vivid emotions and intonations. Through these innovative designs, EMOVA not only achieves state - of - the - art performance in vision - language and speech benchmarks but also realizes for the first time a multimodal large language model that supports voice conversations with vivid emotions. ### Formula representation In describing the model architecture and training process, some formulas are involved. The following are some key formulas represented in Markdown format: 1. **Joint probability calculation**: \[ P(U_o^T, U_o^S|U^T, U^S, H^V)=\prod_{i = 1}^{L}P(x_i|U_o^T_{<i}, U_o^S_{<i}, U^T, U^S, H^V) \] where \(x_i\in U_o^T\cup U_o^S\), \(L = |U_o^T|+|U_o^S|\). 2. **Visual feature projection**: \[ H^V = p(E^V) \] where \(E^V = v(X^V)\) is the continuous visual feature output by the visual encoder, and \(p(\cdot)\) is the projection function. 3. **Speech unit quantization**: \[ U^S = q(E^S) \] where \(E^S = s(X^S)\) is the continuous speech feature output by the speech encoder, and \(q(\cdot)\) is the quantization function. These formulas ensure that the model can perform effective alignment and conversion between different modalities.