Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng,Zhi-Qi Cheng,Jun-Yan He,Jingdong Sun,Kai Wang,Yuxiang Lin,Zheng Lian,Xiaojiang Peng,Alexander Hauptmann
2024-06-17
Abstract:Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The paper aims to address the challenges in multimodal emotion recognition and reasoning. Specifically, existing unimodal methods are insufficient in capturing the complex emotional expressions in the real world, and multimodal large language models (MLLM) also face difficulties in integrating audio information and recognizing subtle facial micro-expressions. To tackle these issues, the paper presents the following contributions: 1. **MERR Dataset**: It contains 28,618 coarse-grained annotated samples and 4,487 fine-grained annotated samples, covering various emotion categories such as "suspicion" and "contempt." These diverse annotations enable the model to learn from different scenarios and generalize to real-world applications. 2. **Emotion-LLaMA Model**: This model seamlessly integrates audio, visual, and textual inputs through an emotion-specific encoder. By aligning features into a shared space and performing instruction tuning on a modified LLaMA model, it significantly enhances the capabilities of emotion recognition and reasoning. 3. **Experimental Results**: Emotion-LLaMA outperforms other MLLMs on multiple datasets, including EMER, MER2023, and DFEW. It achieves excellent results in Clue Overlap, Label Overlap, F1 score, and zero-shot evaluation metrics such as Unweighted Average Recall (UAR) and Weighted Average Recall (WAR). In summary, the paper primarily addresses the issues of audio processing and subtle facial expression recognition in multimodal emotion recognition and reasoning, and proposes a new model framework to improve the accuracy and depth of emotion recognition.