Abstract:In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multimodal emotion recognition (MER), especially to improve the accuracy and generalization performance of the model in the MER2024 - SEMI challenge. Specifically, the methods proposed by the author aim to solve the following problems: 1. **Difficulty in data annotation and limited data volume**: - Multimodal emotion recognition requires high - quality annotated data, but collecting such data is very difficult and costly. This has led to the scarcity of training data, thus affecting the performance of the model. 2. **Modal dependence and modal competition**: - In the multimodal fusion process, there are dependence and competition phenomena among different modalities. Some modalities may dominate the information fusion, while the information of other modalities is ignored, thus affecting the overall performance. 3. **Loss of temporal information in video emotion recognition**: - When using pre - trained models such as CLIP to extract video features, it is easy to lose the temporal information of the video, resulting in a decline in emotion recognition ability. 4. **Making full use of unlabeled data**: - How to effectively use a large amount of unlabeled data to improve the performance of the model is an important issue. To solve the above problems, the author proposes the following methods: - **EmoVCLIP**: Based on the CLIP model, fine - tune through vision - language prompting learning to better adapt to the video emotion recognition task. \[ \text{EmoVCLIP}=\text{CLIP}+\text{Vision - Language Prompt Learning} \] - **Modality Dropout**: Randomly discard the information of certain modalities during the training process to enhance the robustness and generalization ability of the model to different modalities. \[ p_{\text{pred}} = g(\text{concat}(e_S, e_I, e_T, e_V)) \] \[ e_i = 0, \quad i\in\{S, I, T, V\}, \quad p = p_1 \] - **GPT4 - Baichuan**: Combine GPT4's language understanding and Baichuan's Chinese processing ability to enhance the extraction of text emotion features. \[ \text{GPT4 - Baichuan}=\text{GPT4 (Emotion Extraction)}+\text{Baichuan (Chinese Language Processing)} \] - **Self - training**: Use pseudo - labels to include unlabeled data in the training set to make full use of unlabeled data. Through these methods, the author's model has achieved a significant performance improvement in the MER2024 - SEMI challenge, and finally reached an accuracy rate of 90.15% on the test set.

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Emotion Recognition in Videos via Fusing Multimodal Features.

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning.

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning

Multimodal interaction enhanced representation learning for video emotion recognition

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model

Memobert: Pre-Training Model with Prompt-Based Learning for Multimodal Emotion Recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference