Abstract:Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal Emotion Recognition in Conversation (ERC). Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal conversational emotion recognition models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key challenges in multi - modal dialogue emotion recognition (ERC). Specifically, the paper proposes a graph - network - based multi - modal fusion technique (GraphMFT) to improve the accuracy of emotion recognition in dialogue. The main problems include: 1. **Multi - modal information fusion**: - Existing multi - modal emotion recognition methods either ignore multi - modal information or fail to effectively handle the interaction information between modalities. GraphMFT captures the relationships between different modalities by constructing multiple graphs, thus better fusing multi - modal data. 2. **Long - distance context information capture**: - Traditional methods based on recurrent neural networks have difficulty in capturing long - distance context information, while graph neural networks (GNNs) can effectively capture such information through their powerful relationship - modeling capabilities. GraphMFT utilizes an improved graph attention network (GATs) to extract long - distance context information. 3. **Data heterogeneity**: - Multi - modal data have different characteristics. Directly inputting data of different modalities into the same model will lead to the problem of data heterogeneity. GraphMFT reduces the impact of data heterogeneity by processing the data of every two modalities separately. 4. **Noise reduction**: - Existing methods usually connect the current node to all other nodes within the modality, which will introduce additional noise. GraphMFT reduces potential noise by selectively connecting the current node to context nodes. ### Solutions To address the above challenges, the paper proposes GraphMFT, whose main features include: - **Multi - graph construction**: - Construct three graphs (V - A graph, V - T graph, and A - T graph), each graph containing information of two modalities. This can reduce data heterogeneity and more effectively capture the interaction information between modalities. - **Improved graph attention network (GATs)**: - Use multiple improved GATs to extract the intra - and inter - modal dependencies. The improved GATs alleviate the over - smoothing problem of GNNs by connecting the output of the previous layer network to the output of the next layer network. - **Multi - modal fusion**: - Obtain the final multi - modal fusion feature matrix by adding the feature matrices of the same modality. These feature matrices are then used for emotion prediction. - **Emotion prediction**: - Input the multi - modal - fused feature matrix into a fully - connected network for emotion prediction, and use the cross - entropy loss function for model training. ### Experimental results The experimental results show that GraphMFT significantly outperforms existing baseline methods on two public multi - modal datasets (IEMOCAP and MELD), achieving accuracies of 67.90% and 61.30% respectively. In conclusion, through proposing GraphMFT, this paper effectively solves multiple key problems in multi - modal dialogue emotion recognition and improves the accuracy and robustness of emotion recognition.

GraphMFT: A Graph Network based Multimodal Fusion Technique for Emotion Recognition in Conversation

GraphMFT: A Graph Network Based Multimodal Fusion Technique for Emotion Recognition in Conversation

GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Bi-stream graph learning based multimodal fusion for emotion recognition in conversation

GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

Synch-Graph: Multisensory Emotion Recognition Through Neural Synchrony Via Graph Convolutional Networks.

MMDAG: Multimodal Directed Acyclic Graph Network for Emotion Recognition in Conversation

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

DGFN Multimodal Emotion Analysis Model Based on Dynamic Graph Fusion Network

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning