Leveraging Retrieval Augment Approach for Multimodal Emotion Recognition Under Missing Modalities

Qi Fan,Hongyu Yuan,Haolin Zuo,Rui Liu,Guanglai Gao
2024-09-19
Abstract:Multimodal emotion recognition utilizes complete multimodal information and robust multimodal joint representation to gain high performance. However, the ideal condition of full modality integrity is often not applicable in reality and there always appears the situation that some modalities are missing. For example, video, audio, or text data is missing due to sensor failure or network bandwidth problems, which presents a great challenge to MER research. Traditional methods extract useful information from the complete modalities and reconstruct the missing modalities to learn robust multimodal joint representation. These methods have laid a solid foundation for research in this field, and to a certain extent, alleviated the difficulty of multimodal emotion recognition under missing modalities. However, relying solely on internal reconstruction and multimodal joint learning has its limitations, especially when the missing information is critical for emotion recognition. To address this challenge, we propose a novel framework of Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER), which introduces similar multimodal emotion data to enhance the performance of emotion recognition under missing modalities. By leveraging databases, that contain related multimodal emotion data, we can retrieve similar multimodal emotion information to fill in the gaps left by missing modalities. Various experimental results demonstrate that our framework is superior to existing state-of-the-art approaches in missing modality MER tasks. Our whole project is publicly available on <a class="link-external link-https" href="https://github.com/WooyoohL/Retrieval_Augment_MER" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the performance degradation issue in Multimodal Emotion Recognition (MER) when some modality data is missing. Specifically, when video, audio, or text data is missing due to sensor failure or network bandwidth issues, traditional multimodal emotion recognition methods often struggle to accurately recognize emotions. Existing methods mainly rely on internal reconstruction and multimodal joint learning to handle missing modalities, but these methods are limited when critical information is missing. Therefore, this paper proposes a new framework—Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER), which enhances emotion recognition performance by introducing similar multimodal emotion data. ### Main Contributions 1. **Exploration of a New Application Field for Retrieval Augmentation**: Constructing a multimodal emotion feature database and applying it to enhance multimodal emotion recognition. 2. **Proposing the RAMER Framework**: Introducing an emotion feature retrieval method to supplement the information loss caused by missing modalities, thereby improving the accuracy and robustness of emotion recognition. 3. **Experimental Results**: Under various missing modality conditions, experimental results show that this framework outperforms existing state-of-the-art methods and demonstrates strong robustness in the face of missing data. ### Method Overview 1. **Full Modality Pre-training**: Using complete annotated data to train a basic multimodal emotion recognition model, ensuring the model can capture comprehensive unimodal emotion features. 2. **Retrieval Database Construction**: Using the pre-trained model to infer the entire dataset (including annotated and unannotated data), saving the unimodal emotion hidden layer features before each unimodal classifier, and constructing a feature index. 3. **Missing Modality Training**: Training the model under various missing modality conditions, supplementing the missing modality information through the retrieval database, enabling the model to effectively predict emotions even when some modalities are missing. ### Experimental Results - **Main Results**: Under all missing modality conditions, the proposed model significantly improves performance, especially when the text modality is involved, showing a particularly noticeable performance boost. - **Ablation Study**: Various experimental settings were used to verify the effectiveness of each module, particularly under different scales of the retrieval database and different similar feature fusion strategies, the model's performance remained stable. - **Visualization Analysis**: Using the T-SNE algorithm to visualize emotion features of different modalities, the results show that the model can effectively learn and represent emotion features. ### Conclusion This paper proposes a new method that effectively addresses the performance degradation issue in multimodal emotion recognition when some modality data is missing by constructing a multimodal emotion feature database and utilizing retrieval augmentation technology. Experimental results show that this method outperforms existing methods under various missing modality conditions and demonstrates strong robustness. ### Limitations Despite the encouraging results achieved by this method, there are still two main limitations: 1. **Retrieval Time**: As the database size increases, the retrieval time will also increase accordingly, which may pose challenges in practical applications. 2. **Content Filtering**: The current feature fusion strategy may inadvertently introduce irrelevant emotion features, leading to overall performance degradation. Developing more accurate feature selection algorithms will be a key direction for future improvements.