Abstract:With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in multi - modal emotion recognition tasks, existing methods are unable to eliminate the heterogeneity between different modal data during the feature fusion process, resulting in difficulty in effectively learning the emotion classification boundaries. Specifically, existing multi - modal feature fusion methods usually map features of different modalities into the same feature space for information fusion. This method cannot eliminate the heterogeneity between modalities, thus affecting the subsequent boundary learning effect of emotion categories. To overcome this problem, the authors propose a new method - multi - modal emotion recognition with adversarial representation and intra - and inter - modal graph contrastive learning (AR - IIGCN). The main innovation points of this method include: 1. **Multi - Layer Perceptron (MLP) Mapping**: First, use MLP to map text, video, and audio features into different feature spaces respectively, rather than a unified feature space, in order to reduce the heterogeneity between modalities. 2. **Generative Adversarial Network (GAN)**: Build a generator and a discriminator, and achieve cross - modal feature fusion through adversarial learning, and eliminate the heterogeneity between modalities. 3. **Graph Contrastive Representation Learning**: Introduce graph contrastive representation learning to capture complementary semantic information within and between modalities, as well as differences within and between categories. Specifically, construct a graph structure to perform contrastive representation learning on different emotion nodes within the same modality and the same emotion nodes within different modalities, so as to improve the representation ability of node features. 4. **Multi - Loss Function Design**: Design a new multi - loss function for graph contrastive representation learning, making positive sample embedding vectors close to anchor point embedding vectors, while negative sample embedding vectors are far away from anchor point embedding vectors. Through these innovations, the AR - IIGCN method can significantly improve the accuracy of emotion recognition on the IEMOCAP and MELD datasets, and due to its generality, it can be applied to other multi - modal tasks, such as humor detection.

Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition

Cross-Culture Multimodal Emotion Recognition With Adversarial Learning

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimodal Emotion Recognition

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Contrastive Learning based Modality-Invariant Feature Acquisition for Robust Multimodal Emotion Recognition with Missing Modalities

Multi-modal fusion network with complementarity and importance for emotion recognition

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

A Versatile Multimodal Learning Framework For Zero-shot Emotion Recognition

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

Learning Robust Multi-Modal Representation for Multi-Label Emotion Recognition Via Adversarial Masking and Perturbation

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion

Multimodal emotion recognition based on audio and text by using hybrid attention networks

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition