CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition.

Jiang Li,Xiaoping Wang,Yingjian Liu,Zhigang Zeng
DOI: https://doi.org/10.1109/taffc.2024.3389453
2023-01-01
Abstract:Multimodal emotion recognition in conversation (ERC) has garnered growingattention from research communities in various fields. In this paper, wepropose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) forERC. Extant approaches employ each modality equally without distinguishing theamount of emotional information in these modalities, rendering it hard toadequately extract complementary information from multimodal data. To cope withthis problem, in CFN-ESA, we treat textual modality as the primary source ofemotional information, while visual and acoustic modalities are taken as thesecondary sources. Besides, most multimodal ERC models ignore emotion-shiftinformation and overfocus on contextual information, leading to the failure ofemotion recognition under emotion-shift scenario. We elaborate an emotion-shiftmodule to address this challenge. CFN-ESA mainly consists of unimodal encoder(RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME isapplied to extract conversation-level contextual emotional cues while pullingtogether data distributions between modalities; ACME is utilized to performmultimodal interaction centered on textual modality; LESM is used to modelemotion shift and capture emotion-shift information, thereby guiding thelearning of the main task. Experimental results demonstrate that CFN-ESA caneffectively promote performance for ERC and remarkably outperformstate-of-the-art models.
What problem does this paper attempt to address?