Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Bobo Li,Hao Fei,Lizi Liao,Yu Zhao,Chong Teng,Tat-Seng Chua,Donghong Ji,Fei Li
2023-08-12
Abstract:It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively decouple and fuse different modal features and conversation context when performing multimodal emotion recognition in conversation (MM - ERC) in a conversation scenario, so as to improve task performance. Specifically, the paper points out that most of the existing methods regard MM - ERC as a standard multimodal classification problem, mainly focusing on the decoupling and fusion of multimodal features, but ignoring the relationship between the conversation context and the consistency of multimodal features. Therefore, the paper proposes a new framework DF - ERC (Disentanglement & Fusion for Emotion Recognition in Conversation), aiming to simultaneously consider the modeling of multimodal and conversation context, thereby further improving task performance. ### Main Contributions 1. **Re - examining the MM - ERC task**: It is the first time to propose feature decoupling and fusion from the two perspectives of multimodal and context to enhance task performance. 2. **Technical contributions**: Three novel and effective mechanisms are proposed to decouple and fuse multimodal and context features: - **Dual - level Disentanglement Mechanism (DDM)**: Based on contrastive learning techniques, decouple features into modal space and utterance space. - **Contribution - aware Fusion Mechanism (CFM)**: Dynamically manage the contribution of multimodal features to achieve controllable feature coordination. - **Context Refusion Mechanism (CRM)**: Flexibly introduce historical conversation context to avoid prediction bias caused by over - relying on historical information. 3. **Empirical contributions**: The system has achieved state - of - the - art performance on two widely - used benchmark datasets, MELD and IEMOCAP. 4. **Application potential**: The proposed method has broad application potential and can promote the development of other multimodal conversation tasks. ### Method Overview 1. **Multimodal feature encoding**: Use pre - trained models (such as RoBERTa) to extract text features, and use OpenSmile and DenseNet to extract audio and video features. 2. **Dual - level Disentanglement Mechanism (DDM)**: Through contrastive learning techniques, decouple features at the modal level and the utterance level respectively to reduce the influence of irrelevant features. 3. **Contribution - aware Fusion Mechanism (CFM)**: Dynamically allocate fusion weights according to the true classification probability of each modality to improve the controllability and effectiveness of feature fusion. 4. **Context Refusion Mechanism (CRM)**: Through prototype vector learning, calculate the consistency degree of multimodal features and flexibly decide how much historical context information to introduce. 5. **Prediction and learning**: The final fused features are used for emotion recognition and trained through the cross - entropy loss function. ### Experimental Results The experimental results show that DF - ERC significantly outperforms the existing baseline models on the two datasets, MELD and IEMOCAP, especially in terms of weighted F1 - score (W - F1) and accuracy (Acc). These results verify the effectiveness and stability of the mechanisms proposed in the paper in improving the performance of multimodal emotion recognition tasks.