EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Qu Yang,Mang Ye,Bo Du

2024-06-29

Abstract:Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of 287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the shortcomings of Multimodal Large Language Models (MLLM) in emotional understanding. Although current MLLMs perform well on objective multimodal perception tasks, their ability to interpret subjective and emotionally nuanced content still needs improvement. Specifically, existing models perform poorly when dealing with complex emotions (such as anger and sadness), especially in emotional tasks that require the integration of visual, auditory, and textual cues. To address this issue, the paper proposes two main contributions: 1. **EmoBench**: This is a comprehensive benchmark dataset containing approximately 287,000 images and video data along with their corresponding textual instructions, used to evaluate MLLM performance on five popular emotional tasks. EmoBench is divided into two categories of tasks: general emotional tasks (such as multimodal emotion recognition and intent understanding) and emotional application tasks (such as hate, sarcasm, and humor detection). 2. **EmoLLM**: This is a novel model that combines two core technologies: - Multi-perspective Visual Projection: Captures diverse emotional cues in visual data from multiple perspectives. - EmoPrompt: Guides MLLM in the correct reasoning direction to improve the accuracy of emotional understanding. Experimental results show that EmoLLM significantly enhances multimodal emotional understanding performance on the EmoBench benchmark dataset, with an average improvement of 12.1%. Additionally, EmoLLM outperforms other baseline models across various foundational models. This work contributes to the advancement of MLLMs and provides new possibilities for applications in human-computer interaction, mental health support, and empathetic AI systems.

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

EmoBench: Evaluating the Emotional Intelligence of Large Language Models

Speak From Heart: An Emotion-Guided LLM-Based Multimodal Method for Emotional Dialogue Generation

MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

EmotionQueen: A Benchmark for Evaluating Empathy of Large Language Models

Large Language Models Understand and Can be Enhanced by Emotional Stimuli

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

Emotional intelligence of Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Recent Advancement of Emotion Cognition in Large Language Models

DeepPavlov at SemEval-2024 Task 3: Multimodal Large Language Models in Emotion Reasoning

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models

RL-EMO: A Reinforcement Learning Framework for Multimodal Emotion Recognition.

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning