Abstract:It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

GraphMFT: A Graph Network Based Multimodal Fusion Technique for Emotion Recognition in Conversation

Bi-stream graph learning based multimodal fusion for emotion recognition in conversation

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

A multi-stage dynamical fusion network for multimodal emotion recognition

Fusion with Hierarchical Graphs for Mulitmodal Emotion Recognition

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

MMDAG: Multimodal Directed Acyclic Graph Network for Emotion Recognition in Conversation

Fusing pairwise modalities for emotion recognition in conversations

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Conversational Emotion Detection

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals

MultiEMO: an Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations.

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning