Abstract:GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct an end - to - end multimodal large language model (LLM) that can process data in three modalities: vision, text, and speech, support voice conversations with vivid emotions, and maintain state - of - the - art performance on vision - language tasks. Specifically: 1. **Multimodal data processing**: Existing multimodal large language models can usually only process data in two modalities, such as vision - language or speech - language. How to effectively endow large language models with the ability to process data in three modalities: vision, text, and speech in an end - to - end manner remains an open problem. 2. **Speech generation and understanding**: Existing multimodal large language models rely on external tools (such as TTS tools) in speech generation, which limits their real - time interaction capabilities. In addition, these models also have deficiencies in speech understanding, especially performing poorly in high - resolution image processing. 3. **Emotion control**: Existing work has not yet explored how to achieve flexible control of voice styles (such as emotion, intonation) in large language models, which is very important for human - machine conversations in real life. To solve these problems, the paper proposes EMOVA (Emotionally Omni - present Voice Assistant), which is a new end - to - end multimodal large language model with the following characteristics: - **Visual encoder**: Use a continuous visual encoder to capture fine - grained visual details. - **Semantic - acoustic decoupled speech tokenizer**: Convert the input speech waveform into discrete speech units and integrate them seamlessly with the large language model while supporting diverse voice - style control. - **Lightweight style module**: Introduce a lightweight style module to support voice conversations with vivid emotions and intonations. Through these innovative designs, EMOVA not only achieves state - of - the - art performance in vision - language and speech benchmarks but also realizes for the first time a multimodal large language model that supports voice conversations with vivid emotions. ### Formula representation In describing the model architecture and training process, some formulas are involved. The following are some key formulas represented in Markdown format: 1. **Joint probability calculation**: \[ P(U_o^T, U_o^S|U^T, U^S, H^V)=\prod_{i = 1}^{L}P(x_i|U_o^T_{<i}, U_o^S_{<i}, U^T, U^S, H^V) \] where \(x_i\in U_o^T\cup U_o^S\), \(L = |U_o^T|+|U_o^S|\). 2. **Visual feature projection**: \[ H^V = p(E^V) \] where \(E^V = v(X^V)\) is the continuous visual feature output by the visual encoder, and \(p(\cdot)\) is the projection function. 3. **Speech unit quantization**: \[ U^S = q(E^S) \] where \(E^S = s(X^S)\) is the continuous speech feature output by the speech encoder, and \(q(\cdot)\) is the quantization function. These formulas ensure that the model can perform effective alignment and conversion between different modalities.

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

BLSP-Emo: Towards Empathetic Large Speech-Language Models

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Advancing Speech Language Models by Scaling Supervised Fine-Tuning with Over 60,000 Hours of Synthetic Speech Dialogue Data

EVLM: An Efficient Vision-Language Model for Visual Understanding

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Emo-Tts:Parallel Transformer-based Text-to-Speech Model with Emotional Awareness

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Emotional Audio-Visual Speech Synthesis Based on PAD

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis

VCEMO: Multi-Modal Emotion Recognition for Chinese Voiceprints

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Tackling Vision Language Tasks Through Learning Inner Monologues

Emotion Inferring from Large-scale Internet Voice Data: A Multimodal Deep Learning Approach

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning