Abstract:Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human–computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross-modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model’s ability to handle few-shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few-shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

Adapted Dynamic Memory Network for Emotion Recognition in Conversation

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Dynamic Interactive Multiview Memory Network for Emotion Recognition in Conversation

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

MMGCN: Multimodal Fusion Via Deep Graph Convolution Network for Emotion Recognition in Conversation

EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation

Emotion Recognition in Conversation Based on a Dynamic Complementary Graph Convolutional Network

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

An Emotion Evolution Network for Emotion Recognition in Conversation

SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations

An Iterative Emotion Interaction Network for Emotion Recognition in Conversations

Dynamic Emotion-Dependent Network with Relational Subgraph Interaction for Multimodal Emotion Recognition

HAAN-ERC: Hierarchical Adaptive Attention Network for Multimodal Emotion Recognition in Conversation

DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis.

Dialogue emotion model based on local–global context encoder and commonsense knowledge fusion attention

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation