Bi-stream graph learning based multimodal fusion for emotion recognition in conversation

Nannan Lu,Zhiyuan Han,Min Han,Jiansheng Qian
DOI: https://doi.org/10.2139/ssrn.4614720
IF: 18.6
2024-02-03
Information Fusion
Abstract:Emotion Recognition in Conversation (ERC) is the process of automatically detecting and understanding emotions expressed in a conversation, which plays an important role in human–computer interaction. A conversation generates different modality data including words, tone of voice and facial expression. Multimodal ERC can fuse the information from multiple views to comprehensively model emotion dynamics in a conversation. Graph Neural Networks (GNNs) are employed by multimodal ERC to learn intra-modal long-range contextual information and inter-modal interaction. However, fusing different modalities within a graph may generate the conflict of multimodal information and suffer from data heterogeneity issue. In the paper, we propose a novel Bi-stream Graph Learning based Multimodal Fusion (BiGMF) approach for ERC. It consists of a unimodal stream graph learning for modeling the intra-modal long-range context information and a cross-modal stream graph learning for modeling the inter-modal interactions, which uses GNNs to learn the intra- and inter-modal information in parallel. The separation learning scheme can successfully alleviate the conflict and heterogeneity in multimodal data fusion, and promote the explicitly modeling of cross-modal relations. The experimental results on two public datasets further verify that the superiority of the proposed approach compared to the SOTA approaches.
computer science, artificial intelligence, theory & methods
What problem does this paper attempt to address?