Abstract:Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human–computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross-modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model’s ability to handle few-shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few-shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

Real-Time Video Emotion Recognition Based on Reinforcement Learning and Domain Knowledge

CKERC : Joint Large Language Models with Commonsense Knowledge for Emotion Recognition in Conversation

Emotion recognition in conversations with emotion shift detection based on multi-task learning

InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models

ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

Human-Robot Emotional Interaction Model Based on Reinforcement Learning

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning

Dialogue emotion model based on local–global context encoder and commonsense knowledge fusion attention

Conversational transfer learning for emotion recognition

EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation

BERT-ERC: Fine-tuning BERT is Enough for Emotion Recognition in Conversation

Emotion Recognition in Conversation Based on a Dynamic Complementary Graph Convolutional Network

Multi-Modal Attentive Prompt Learning for Few-shot Emotion Recognition in Conversations

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

LR-GCN: Latent Relation-Aware Graph Convolutional Network for Conversational Emotion Recognition

Emotion Recognition in Conversation using Probabilistic Soft Logic

Distribution-based Emotion Recognition in Conversation

Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances

Contextual Information and Commonsense Based Prompt for Emotion Recognition in Conversation