Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng,Zhi-Qi Cheng,Jun-Yan He,Jingdong Sun,Kai Wang,Yuxiang Lin,Zheng Lian,Xiaojiang Peng,Alexander Hauptmann

2024-06-17

Abstract:Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

Artificial Intelligence,Multimedia

What problem does this paper attempt to address?

The paper aims to address the challenges in multimodal emotion recognition and reasoning. Specifically, existing unimodal methods are insufficient in capturing the complex emotional expressions in the real world, and multimodal large language models (MLLM) also face difficulties in integrating audio information and recognizing subtle facial micro-expressions. To tackle these issues, the paper presents the following contributions: 1. **MERR Dataset**: It contains 28,618 coarse-grained annotated samples and 4,487 fine-grained annotated samples, covering various emotion categories such as "suspicion" and "contempt." These diverse annotations enable the model to learn from different scenarios and generalize to real-world applications. 2. **Emotion-LLaMA Model**: This model seamlessly integrates audio, visual, and textual inputs through an emotion-specific encoder. By aligning features into a shared space and performing instruction tuning on a modified LLaMA model, it significantly enhances the capabilities of emotion recognition and reasoning. 3. **Experimental Results**: Emotion-LLaMA outperforms other MLLMs on multiple datasets, including EMER, MER2023, and DFEW. It achieves excellent results in Clue Overlap, Label Overlap, F1 score, and zero-shot evaluation metrics such as Unweighted Average Recall (UAR) and Weighted Average Recall (WAR). In summary, the paper primarily addresses the issues of audio processing and subtle facial expression recognition in multimodal emotion recognition and reasoning, and proposes a new model framework to improve the accuracy and depth of emotion recognition.

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations

Explainable Multimodal Emotion Reasoning: a Promising Way to Open-set Emotion Recognition

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

RL-EMO: A Reinforcement Learning Framework for Multimodal Emotion Recognition.

Multimodal Emotion Recognition Based on Feature Selection and Extreme Learning Machine in Video Clips.

Speak From Heart: An Emotion-Guided LLM-Based Multimodal Method for Emotional Dialogue Generation

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Facial Affective Behavior Analysis with Instruction Tuning

EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis

Mining High-quality Samples from Raw Data and Majority Voting Method for Multimodal Emotion Recognition

Multimodal Speech Emotion Recognition Based on Multi-Scale MFCCs and Multi-View Attention Mechanism