Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Anbin QI,Zhongliang Liu,Xinyong Zhou,Jinba Xiao,Fengrun Zhang,Qi Gan,Ming Tao,Gaozheng Zhang,Lu Zhang
2024-09-11
Abstract:In this paper, we present our solution for the Second Multimodal Emotion Recognition Challenge Track 1(MER2024-SEMI). To enhance the accuracy and generalization performance of emotion recognition, we propose several methods for Multimodal Emotion Recognition. Firstly, we introduce EmoVCLIP, a model fine-tuned based on CLIP using vision-language prompt learning, designed for video-based emotion recognition tasks. By leveraging prompt learning on CLIP, EmoVCLIP improves the performance of pre-trained CLIP on emotional videos. Additionally, to address the issue of modality dependence in multimodal fusion, we employ modality dropout for robust information fusion. Furthermore, to aid Baichuan in better extracting emotional information, we suggest using GPT-4 as the prompt for Baichuan. Lastly, we utilize a self-training strategy to leverage unlabeled videos. In this process, we use unlabeled videos with high-confidence pseudo-labels generated by our model and incorporate them into the training set. Experimental results demonstrate that our model ranks 1st in the MER2024-SEMI track, achieving an accuracy of 90.15% on the test set.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve several key problems in multimodal emotion recognition (MER), especially to improve the accuracy and generalization performance of the model in the MER2024 - SEMI challenge. Specifically, the methods proposed by the author aim to solve the following problems: 1. **Difficulty in data annotation and limited data volume**: - Multimodal emotion recognition requires high - quality annotated data, but collecting such data is very difficult and costly. This has led to the scarcity of training data, thus affecting the performance of the model. 2. **Modal dependence and modal competition**: - In the multimodal fusion process, there are dependence and competition phenomena among different modalities. Some modalities may dominate the information fusion, while the information of other modalities is ignored, thus affecting the overall performance. 3. **Loss of temporal information in video emotion recognition**: - When using pre - trained models such as CLIP to extract video features, it is easy to lose the temporal information of the video, resulting in a decline in emotion recognition ability. 4. **Making full use of unlabeled data**: - How to effectively use a large amount of unlabeled data to improve the performance of the model is an important issue. To solve the above problems, the author proposes the following methods: - **EmoVCLIP**: Based on the CLIP model, fine - tune through vision - language prompting learning to better adapt to the video emotion recognition task. \[ \text{EmoVCLIP}=\text{CLIP}+\text{Vision - Language Prompt Learning} \] - **Modality Dropout**: Randomly discard the information of certain modalities during the training process to enhance the robustness and generalization ability of the model to different modalities. \[ p_{\text{pred}} = g(\text{concat}(e_S, e_I, e_T, e_V)) \] \[ e_i = 0, \quad i\in\{S, I, T, V\}, \quad p = p_1 \] - **GPT4 - Baichuan**: Combine GPT4's language understanding and Baichuan's Chinese processing ability to enhance the extraction of text emotion features. \[ \text{GPT4 - Baichuan}=\text{GPT4 (Emotion Extraction)}+\text{Baichuan (Chinese Language Processing)} \] - **Self - training**: Use pseudo - labels to include unlabeled data in the training set to make full use of unlabeled data. Through these methods, the author's model has achieved a significant performance improvement in the MER2024 - SEMI challenge, and finally reached an accuracy rate of 90.15% on the test set.