Abstract:It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively decouple and fuse different modal features and conversation context when performing multimodal emotion recognition in conversation (MM - ERC) in a conversation scenario, so as to improve task performance. Specifically, the paper points out that most of the existing methods regard MM - ERC as a standard multimodal classification problem, mainly focusing on the decoupling and fusion of multimodal features, but ignoring the relationship between the conversation context and the consistency of multimodal features. Therefore, the paper proposes a new framework DF - ERC (Disentanglement & Fusion for Emotion Recognition in Conversation), aiming to simultaneously consider the modeling of multimodal and conversation context, thereby further improving task performance. ### Main Contributions 1. **Re - examining the MM - ERC task**: It is the first time to propose feature decoupling and fusion from the two perspectives of multimodal and context to enhance task performance. 2. **Technical contributions**: Three novel and effective mechanisms are proposed to decouple and fuse multimodal and context features: - **Dual - level Disentanglement Mechanism (DDM)**: Based on contrastive learning techniques, decouple features into modal space and utterance space. - **Contribution - aware Fusion Mechanism (CFM)**: Dynamically manage the contribution of multimodal features to achieve controllable feature coordination. - **Context Refusion Mechanism (CRM)**: Flexibly introduce historical conversation context to avoid prediction bias caused by over - relying on historical information. 3. **Empirical contributions**: The system has achieved state - of - the - art performance on two widely - used benchmark datasets, MELD and IEMOCAP. 4. **Application potential**: The proposed method has broad application potential and can promote the development of other multimodal conversation tasks. ### Method Overview 1. **Multimodal feature encoding**: Use pre - trained models (such as RoBERTa) to extract text features, and use OpenSmile and DenseNet to extract audio and video features. 2. **Dual - level Disentanglement Mechanism (DDM)**: Through contrastive learning techniques, decouple features at the modal level and the utterance level respectively to reduce the influence of irrelevant features. 3. **Contribution - aware Fusion Mechanism (CFM)**: Dynamically allocate fusion weights according to the true classification probability of each modality to improve the controllability and effectiveness of feature fusion. 4. **Context Refusion Mechanism (CRM)**: Through prototype vector learning, calculate the consistency degree of multimodal features and flexibly decide how much historical context information to introduce. 5. **Prediction and learning**: The final fused features are used for emotion recognition and trained through the cross - entropy loss function. ### Experimental Results The experimental results show that DF - ERC significantly outperforms the existing baseline models on the two datasets, MELD and IEMOCAP, especially in terms of weighted F1 - score (W - F1) and accuracy (Acc). These results verify the effectiveness and stability of the mechanisms proposed in the paper in improving the performance of multimodal emotion recognition tasks.

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Fusing pairwise modalities for emotion recognition in conversations

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

MultiEMO: an Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations.

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

Multiplex graph aggregation and feature refinement for unsupervised incomplete multimodal emotion recognition

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition

Target and Source Modality Co-Reinforcement for Emotion Understanding from Asynchronous Multimodal Sequences.

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

Multimodal Emotion Recognition Based on Cascaded Multichannel and Hierarchical Fusion