Abstract:Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human–computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross-modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model’s ability to handle few-shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few-shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

Smile: Spiking Multi-Modal Interactive Label-Guided Enhancement Network for Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

SMIN: Semi-supervised Multi-modal Interaction Network for Conversational Emotion Recognition

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Cross-Modal Guiding Neural Network for Multimodal Emotion Recognition From EEG and Eye Movement Signals

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis

Multimodal Emotion Recognition based on Facial Expressions, Speech, and EEG

Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition

A multi-stage dynamical fusion network for multimodal emotion recognition

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

Multi-modal fusion network with complementarity and importance for emotion recognition

MLGAT: multi-layer graph attention networks for multimodal emotion recognition in conversations

MMDAG: Multimodal Directed Acyclic Graph Network for Emotion Recognition in Conversation

Multi-head attention fusion networks for multi-modal speech emotion recognition