Abstract:In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper mainly proposes solutions to several key challenges in multi - modal emotion recognition. Specifically, the paper attempts to solve the following problems: 1. **Emotion recognition under semi - supervised learning**: - In multi - modal emotion recognition, the scarcity of high - quality labeled data is an important issue. The paper improves the performance of the model by introducing semi - supervised learning methods and using a large amount of unlabeled data. 2. **Emotion recognition in noisy environments**: - In practical applications, audio data is often affected by background noise, which can significantly reduce the accuracy of emotion recognition. The paper proposes a variety of methods to enhance the robustness of the model in noisy environments, including audio denoising and noise - robust automatic speech recognition (ASR) techniques. 3. **Effective fusion of multi - modal information**: - A core challenge in multi - modal emotion recognition is how to effectively fuse information from different modalities. The paper proposes an early - fusion strategy, especially between audio and text modalities. Through joint training and attention mechanisms, it optimizes feature representations, thereby improving the overall performance of the model. 4. **Insufficient data and class imbalance**: - In practical applications, the amount of data for different emotion categories may be unbalanced, which will affect the generalization ability of the model. The paper effectively solves the problems of insufficient data and class imbalance through a multi - round, multi - model voting data - mining method. ### Main contributions 1. **Emotion ViT model**: - A model based on Vision Transformer (ViT) is trained, which can efficiently extract facial expression features. Through self - supervised learning, this model is pre - trained on large - scale unlabeled data and fine - tuned on emotion data, significantly enhancing the ability to express emotion features. 2. **Audio - text joint training architecture**: - To solve the expression conflict in multi - modal emotion recognition, the paper proposes an innovative audio - text joint training structure. Through early - fusion, this structure effectively integrates the information of audio and text modalities, avoiding the information loss caused by late - fusion, and its performance is better than the simple bimodal late - fusion method. 3. **Enhancement of noise robustness**: - To deal with the interference of noise on the performance of emotion recognition in complex environments, the paper implements an audio - denoising strategy and optimizes the noise robustness of the ASR system. These measures jointly improve the stability and recognition accuracy of the model in noisy environments. 4. **Efficient utilization of unlabeled data**: - To fully utilize the potential of unlabeled data, the paper introduces a cyclically enhanced data - mining method. By iteratively using unlabeled data for model training, this method significantly improves the accuracy and generalization ability of the model, achieving efficient utilization of unlabeled data. ### Summary Through a series of innovative methods and techniques, this paper effectively solves several key challenges in multi - modal emotion recognition, especially making significant progress in semi - supervised learning, noise robustness, multi - modal information fusion, and insufficient data and class imbalance. These contributions not only improve the performance of the model but also provide a valuable reference for further research in the field of multi - modal emotion recognition.

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

Multimodal Emotional Classification Based on Meaningful Learning

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Multi-head attention fusion networks for multi-modal speech emotion recognition

First-order Multi-label Learning with Cross-modal Interactions for Multimodal Emotion Recognition

Context-aware Multimodal Fusion for Emotion Recognition

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

A Versatile Multimodal Learning Framework For Zero-shot Emotion Recognition

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

Investigation of Multimodal Features, Classifiers and Fusion Methods for Emotion Recognition

Multimodal emotion recognition based on audio and text by using hybrid attention networks

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Multimodal interaction enhanced representation learning for video emotion recognition

Emotion Recognition Model Based on Multimodal Decision Fusion

Multimodal Emotion Recognition and Sentiment Analysis via Attention Enhanced Recurrent Model

Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition

Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model