Abstract:Emotion Recognition in Conversations (ERC) aims to accurately identify the emotional labels of each utterance in a conversation, holding significant application value in human–computer interaction. Existing research suggests introducing commonsense knowledge (CSK) and multimodal information enhances model performance in ERC tasks. However, several challenges persist: (1) the neglect of complex psychological influences between utterances; (2) noise issues within modal information; (3) prediction challenges for emotion labels with few samples in different categories that exhibit semantic similarity but distinct emotional categories. To address the above problems, we propose a Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning (MKIN-MCL). Firstly, we establish a knowledge aggregation graph to capture the dependencies of commonsense knowledge (CSK) between utterances during a conversation. We actively aggregate relevant knowledge information to enhance text features. Simultaneously, we apply feature filters for acoustic and visual modalities to eliminate noise and enhance feature quality. Furthermore, we implement an interactive attention module by stacking designed Cross-modal Interactive Transformers (CITs) to continuously explore the relevance between the interacting parties in their respective semantic spaces, thus improving the effectiveness of modality interaction while reducing noise generated during the interaction. Lastly, we employ the Mixed Contrastive Learning (MCL) strategy to enhance the model’s ability to handle few-shot labels. This strategy utilizes unsupervised contrastive learning to improve the representation capability of the multimodal fusion features and supervised contrastive learning to extract information from few-shot labels. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, validate the effectiveness and superiority of the proposed model.

Speaker-Independent Speech Emotion Recognition Based On Two-Layer Multiple Kernel Learning

Emotional Speech Clustering Based Robust Speaker Recognition System

Scores Selection for Emotional Speaker Recognition

Learning Polynomial Function Based Neutral-Emotion Gmm Transformation For Emotional Speaker Recognition

Emotional speaker recognition based on similar neighbor phenomenon

Combining Feature Selection And Representation For Speech Emotion Recognition

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Feature Aggregation with Two-Layer Ensemble Framework for Multilingual Speech Emotion Recognition

Machine learning techniques for speech emotion recognition using paralinguistic acoustic features

Effective MLP and CNN based ensemble learning for speech emotion recognition

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation

Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Speech & Song Emotion Recognition Using Multilayer Perceptron and Standard Vector Machine

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning

Comprehensive Speech Emotion Recognition System Employing Multi-Layer Perceptron (MLP) Classifier and libRosa Feature Extraction