GraphMFT: A Graph Network based Multimodal Fusion Technique for Emotion Recognition in Conversation

Jiang Li,Xiaoping Wang,Guoqing Lv,Zhigang Zeng
DOI: https://doi.org/10.1016/j.neucom.2023.126427
2023-12-02
Abstract:Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal Emotion Recognition in Conversation (ERC). Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal conversational emotion recognition models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%.
Multimedia
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve several key challenges in multi - modal dialogue emotion recognition (ERC). Specifically, the paper proposes a graph - network - based multi - modal fusion technique (GraphMFT) to improve the accuracy of emotion recognition in dialogue. The main problems include: 1. **Multi - modal information fusion**: - Existing multi - modal emotion recognition methods either ignore multi - modal information or fail to effectively handle the interaction information between modalities. GraphMFT captures the relationships between different modalities by constructing multiple graphs, thus better fusing multi - modal data. 2. **Long - distance context information capture**: - Traditional methods based on recurrent neural networks have difficulty in capturing long - distance context information, while graph neural networks (GNNs) can effectively capture such information through their powerful relationship - modeling capabilities. GraphMFT utilizes an improved graph attention network (GATs) to extract long - distance context information. 3. **Data heterogeneity**: - Multi - modal data have different characteristics. Directly inputting data of different modalities into the same model will lead to the problem of data heterogeneity. GraphMFT reduces the impact of data heterogeneity by processing the data of every two modalities separately. 4. **Noise reduction**: - Existing methods usually connect the current node to all other nodes within the modality, which will introduce additional noise. GraphMFT reduces potential noise by selectively connecting the current node to context nodes. ### Solutions To address the above challenges, the paper proposes GraphMFT, whose main features include: - **Multi - graph construction**: - Construct three graphs (V - A graph, V - T graph, and A - T graph), each graph containing information of two modalities. This can reduce data heterogeneity and more effectively capture the interaction information between modalities. - **Improved graph attention network (GATs)**: - Use multiple improved GATs to extract the intra - and inter - modal dependencies. The improved GATs alleviate the over - smoothing problem of GNNs by connecting the output of the previous layer network to the output of the next layer network. - **Multi - modal fusion**: - Obtain the final multi - modal fusion feature matrix by adding the feature matrices of the same modality. These feature matrices are then used for emotion prediction. - **Emotion prediction**: - Input the multi - modal - fused feature matrix into a fully - connected network for emotion prediction, and use the cross - entropy loss function for model training. ### Experimental results The experimental results show that GraphMFT significantly outperforms existing baseline methods on two public multi - modal datasets (IEMOCAP and MELD), achieving accuracies of 67.90% and 61.30% respectively. In conclusion, through proposing GraphMFT, this paper effectively solves multiple key problems in multi - modal dialogue emotion recognition and improves the accuracy and robustness of emotion recognition.