Abstract:Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human–computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross-modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model’s ability to handle few-shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few-shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

RL-EMO: A Reinforcement Learning Framework for Multimodal Emotion Recognition.

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

MultiEMO: an Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations.

Multi-Scale Receptive Field Graph Model for Emotion Recognition in Conversations

Real-Time Video Emotion Recognition Based on Reinforcement Learning and Domain Knowledge

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

Human-Robot Emotional Interaction Model Based on Reinforcement Learning

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Multi-Modal Attentive Prompt Learning for Few-shot Emotion Recognition in Conversations

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning

Explainable Multimodal Emotion Reasoning: a Promising Way to Open-set Emotion Recognition

Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation