Abstract:With the continuous development of deep learning (DL), the task of multimodal dialogue emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, in different dialogue scenes. However, existing research has focused on modeling contextual semantic information and dialogue relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method. It models dialogue relations between speakers and captures latent event relations information. Specifically, we construct a weighted multi-relationship graph to simultaneously capture the dependencies between speakers and event relations in a dialogue. Moreover, we also introduce a Self-Supervised Masked Graph Autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new Multiple Information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the f1 value of emotion recognition.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key issues in the task of Multimodal Dialogue Emotion Recognition (MDER): 1. **Influence of External Factors in Emotion Recognition**: - Existing research mainly focuses on modeling contextual semantic information and the relationship between interlocutors, while ignoring the impact of event relationships on emotions. For example, during a conversation, the speaker's emotions are influenced not only by internal factors (such as textual information) but also by external factors (such as events, locations, keywords, etc.). 2. **Data Imbalance Issue**: - Due to high annotation costs, MDER datasets usually exhibit a long-tail distribution, leading to poor performance of models in recognizing minority class emotions. 3. **Cross-Modal Feature Fusion**: - How to effectively fuse information from multiple modalities such as text, video, and audio to improve the accuracy of emotion recognition. ### Solution To address the above issues, the authors propose a new method called **Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN)**. The main innovations of this method include: 1. **Constructing a Weighted Multi-Relation Graph**: - By constructing a weighted multi-relation graph, it simultaneously captures the dependencies between interlocutors and event relationships, thereby considering both internal and external factors of emotional changes more comprehensively. 2. **Self-Supervised Masked Graph Autoencoder (SMGAE)**: - Introducing SMGAE to enhance the fusion representation capability of features and structures, by masking and reconstructing nodes and edges simultaneously, enhancing the model's noise resistance. 3. **Multi-Information Transformer (MIT)**: - Designing MIT to capture the correlations between different relationships, better fusing multivariable information to obtain more discriminative feature embeddings. 4. **Contrastive Learning Loss Optimization Strategy**: - Using a contrastive learning-based loss optimization strategy to alleviate the data imbalance issue, balancing the proportion of each emotion category during training. 5. **Emotion Classifier**: - Constructing a linear layer with residual connections as the emotion classifier to provide more gradient information, promoting sufficient training of the model in the emotion classification process. ### Experimental Validation The authors conducted extensive experiments on two popular benchmark datasets, IEMOCAP and MELD, and the results show that the DER-GCN model significantly outperforms existing comparison algorithms in terms of average accuracy and F1 score. ### Main Contributions 1. **Proposing a New Emotion Representation Learning Architecture Aware of Dialogue and Event Relationships**: - DER-GCN can achieve cross-modal feature fusion, solve the data imbalance issue, and learn more discriminative emotion category boundaries. 2. **Designing a New Self-Supervised Graph Representation Learning Framework**: - SMGAE enhances node feature representation capability, optimizes graph structure representation, and has stronger noise resistance. 3. **Implementing a New Weighted Relation-Aware Multi-Subgraph Information Aggregation Method**: - MIT is used to learn the importance of different relationships in information aggregation, thereby obtaining more discriminative feature embeddings. 4. **Conducting Extensive Experiments on Two Popular Datasets**: - Experimental results show that DER-GCN performs excellently in the multimodal emotion recognition task, especially in terms of weighted accuracy and F1 score.

DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition

DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Dynamic Graph Neural Ordinary Differential Equation Network for Multi-modal Emotion Recognition in Conversation

Dense Graph Convolutional with Joint Cross-Attention Network for Multimodal Emotion Recognition

MMGCN: Multimodal Fusion Via Deep Graph Convolution Network for Emotion Recognition in Conversation

MMDAG: Multimodal Directed Acyclic Graph Network for Emotion Recognition in Conversation

DGSNet: Dual Graph Structure Network for Emotion Recognition in Multimodal Conversations

Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

CONSK-GCN - Conversational Semantic- and Knowledge-Oriented Graph Convolutional Network for Multimodal Emotion Recognition.

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

GraphMFT: A Graph Network based Multimodal Fusion Technique for Emotion Recognition in Conversation

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Dynamic Emotion-Dependent Network with Relational Subgraph Interaction for Multimodal Emotion Recognition

Multiple Knowledge-Enhanced Interactive Graph Network for Multimodal Conversational Emotion Recognition

Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis

Emotion Recognition in Conversation Based on a Dynamic Complementary Graph Convolutional Network

Context- and Knowledge-Aware Graph Convolutional Network for Multimodal Emotion Recognition

DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation