Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Mengying Ge,Mingyang Li,Dongkai Tang,Pengbo Li,Kuo Liu,Shuhao Deng,Songbai Pu,Long Liu,Yang Song,Tao Zhang
2024-09-12
Abstract:In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
Multimedia,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper mainly proposes solutions to several key challenges in multi - modal emotion recognition. Specifically, the paper attempts to solve the following problems: 1. **Emotion recognition under semi - supervised learning**: - In multi - modal emotion recognition, the scarcity of high - quality labeled data is an important issue. The paper improves the performance of the model by introducing semi - supervised learning methods and using a large amount of unlabeled data. 2. **Emotion recognition in noisy environments**: - In practical applications, audio data is often affected by background noise, which can significantly reduce the accuracy of emotion recognition. The paper proposes a variety of methods to enhance the robustness of the model in noisy environments, including audio denoising and noise - robust automatic speech recognition (ASR) techniques. 3. **Effective fusion of multi - modal information**: - A core challenge in multi - modal emotion recognition is how to effectively fuse information from different modalities. The paper proposes an early - fusion strategy, especially between audio and text modalities. Through joint training and attention mechanisms, it optimizes feature representations, thereby improving the overall performance of the model. 4. **Insufficient data and class imbalance**: - In practical applications, the amount of data for different emotion categories may be unbalanced, which will affect the generalization ability of the model. The paper effectively solves the problems of insufficient data and class imbalance through a multi - round, multi - model voting data - mining method. ### Main contributions 1. **Emotion ViT model**: - A model based on Vision Transformer (ViT) is trained, which can efficiently extract facial expression features. Through self - supervised learning, this model is pre - trained on large - scale unlabeled data and fine - tuned on emotion data, significantly enhancing the ability to express emotion features. 2. **Audio - text joint training architecture**: - To solve the expression conflict in multi - modal emotion recognition, the paper proposes an innovative audio - text joint training structure. Through early - fusion, this structure effectively integrates the information of audio and text modalities, avoiding the information loss caused by late - fusion, and its performance is better than the simple bimodal late - fusion method. 3. **Enhancement of noise robustness**: - To deal with the interference of noise on the performance of emotion recognition in complex environments, the paper implements an audio - denoising strategy and optimizes the noise robustness of the ASR system. These measures jointly improve the stability and recognition accuracy of the model in noisy environments. 4. **Efficient utilization of unlabeled data**: - To fully utilize the potential of unlabeled data, the paper introduces a cyclically enhanced data - mining method. By iteratively using unlabeled data for model training, this method significantly improves the accuracy and generalization ability of the model, achieving efficient utilization of unlabeled data. ### Summary Through a series of innovative methods and techniques, this paper effectively solves several key challenges in multi - modal emotion recognition, especially making significant progress in semi - supervised learning, noise robustness, multi - modal information fusion, and insufficient data and class imbalance. These contributions not only improve the performance of the model but also provide a valuable reference for further research in the field of multi - modal emotion recognition.