AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Sheng Wu,Jiaxing Liu,Longbiao Wang,Dongxiao He,Xiaobao Wang,Jianwu Dang
2024-04-12
Abstract:Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion of deep features. Specifically, we design a Modality Augmentation Network which performs rich representation learning through dimension transformation of different modalities and parameter-efficient inception block. On the other hand, the Modality Interaction Network performs interaction fusion of extracted inter-modal features and intra-modal features. Experiments conducted using our AIMDiT framework on the public benchmark dataset MELD reveal 2.34% and 2.87% improvements in terms of the Acc-7 and w-F1 metrics compared to the state-of-the-art (SOTA) models.
Multimedia,Artificial Intelligence,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The main goal of this paper is to address the problem of Emotion Recognition in Conversations (ERC). Current research mainly focuses on context modeling, with less attention given to effective multimodal fusion methods. To address this issue, the authors propose a new framework called AIMDiT for fusing deep features of multimodal data. Specifically, the framework includes the following components: 1. **Modality Augmentation Network (MAN)**: Conducts rich representation learning through dimension transformation of different modalities and parameter-efficient Inception blocks. 2. **Modality Interaction Network (MIN)**: Performs interactive fusion of extracted cross-modal and same-modal features. Experimental results show that AIMDiT improves the Acc-7 and w-F1 metrics on the public benchmark dataset MELD by 2.34% and 2.87%, respectively, outperforming existing state-of-the-art models. Additionally, the paper conducts ablation experiments to verify the effectiveness of each module and explores the impact of different modality combinations.