Abstract:Multimodal emotion recognition utilizes complete multimodal information and robust multimodal joint representation to gain high performance. However, the ideal condition of full modality integrity is often not applicable in reality and there always appears the situation that some modalities are missing. For example, video, audio, or text data is missing due to sensor failure or network bandwidth problems, which presents a great challenge to MER research. Traditional methods extract useful information from the complete modalities and reconstruct the missing modalities to learn robust multimodal joint representation. These methods have laid a solid foundation for research in this field, and to a certain extent, alleviated the difficulty of multimodal emotion recognition under missing modalities. However, relying solely on internal reconstruction and multimodal joint learning has its limitations, especially when the missing information is critical for emotion recognition. To address this challenge, we propose a novel framework of Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER), which introduces similar multimodal emotion data to enhance the performance of emotion recognition under missing modalities. By leveraging databases, that contain related multimodal emotion data, we can retrieve similar multimodal emotion information to fill in the gaps left by missing modalities. Various experimental results demonstrate that our framework is superior to existing state-of-the-art approaches in missing modality MER tasks. Our whole project is publicly available on <a class="link-external link-https" href="https://github.com/WooyoohL/Retrieval_Augment_MER" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the performance degradation issue in Multimodal Emotion Recognition (MER) when some modality data is missing. Specifically, when video, audio, or text data is missing due to sensor failure or network bandwidth issues, traditional multimodal emotion recognition methods often struggle to accurately recognize emotions. Existing methods mainly rely on internal reconstruction and multimodal joint learning to handle missing modalities, but these methods are limited when critical information is missing. Therefore, this paper proposes a new framework—Retrieval Augment for Missing Modality Multimodal Emotion Recognition (RAMER), which enhances emotion recognition performance by introducing similar multimodal emotion data. ### Main Contributions 1. **Exploration of a New Application Field for Retrieval Augmentation**: Constructing a multimodal emotion feature database and applying it to enhance multimodal emotion recognition. 2. **Proposing the RAMER Framework**: Introducing an emotion feature retrieval method to supplement the information loss caused by missing modalities, thereby improving the accuracy and robustness of emotion recognition. 3. **Experimental Results**: Under various missing modality conditions, experimental results show that this framework outperforms existing state-of-the-art methods and demonstrates strong robustness in the face of missing data. ### Method Overview 1. **Full Modality Pre-training**: Using complete annotated data to train a basic multimodal emotion recognition model, ensuring the model can capture comprehensive unimodal emotion features. 2. **Retrieval Database Construction**: Using the pre-trained model to infer the entire dataset (including annotated and unannotated data), saving the unimodal emotion hidden layer features before each unimodal classifier, and constructing a feature index. 3. **Missing Modality Training**: Training the model under various missing modality conditions, supplementing the missing modality information through the retrieval database, enabling the model to effectively predict emotions even when some modalities are missing. ### Experimental Results - **Main Results**: Under all missing modality conditions, the proposed model significantly improves performance, especially when the text modality is involved, showing a particularly noticeable performance boost. - **Ablation Study**: Various experimental settings were used to verify the effectiveness of each module, particularly under different scales of the retrieval database and different similar feature fusion strategies, the model's performance remained stable. - **Visualization Analysis**: Using the T-SNE algorithm to visualize emotion features of different modalities, the results show that the model can effectively learn and represent emotion features. ### Conclusion This paper proposes a new method that effectively addresses the performance degradation issue in multimodal emotion recognition when some modality data is missing by constructing a multimodal emotion feature database and utilizing retrieval augmentation technology. Experimental results show that this method outperforms existing methods under various missing modality conditions and demonstrates strong robustness. ### Limitations Despite the encouraging results achieved by this method, there are still two main limitations: 1. **Retrieval Time**: As the database size increases, the retrieval time will also increase accordingly, which may pose challenges in practical applications. 2. **Content Filtering**: The current feature fusion strategy may inadvertently introduce irrelevant emotion features, leading to overall performance degradation. Developing more accurate feature selection algorithms will be a key direction for future improvements.

Leveraging Retrieval Augment Approach for Multimodal Emotion Recognition Under Missing Modalities

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Learning Noise-Robust Joint Representation for Multimodal Emotion Recognition under Incomplete Data Scenarios

Contrastive Learning based Modality-Invariant Feature Acquisition for Robust Multimodal Emotion Recognition with Missing Modalities

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities

Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Leveraging Label Information for Multimodal Emotion Recognition

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

Multimodal emotion recognition based on audio and text by using hybrid attention networks

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition

Generating and encouraging: An effective framework for solving class imbalance in multimodal emotion recognition conversation

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation