Abstract:In recent years, Intelligent Personal Assistants (IPAs) have emerged as important tools in human–computer interaction, with a wide range of applications such as voice assistant, virtual customer service, and navigation. Capturing and understanding the prominent emotional needs of users is important for improving the quality of service of IPAs. Multimodal emotion recognition in conversation (MMERC) aimed at automatically identifying and tracking the emotional states of speakers during the dialogue process has become a crucial component for building emotional IPAs and attracted increasing attention. Current research in this field is based on graph simulation for cross-modal and single-modal interactions. However, these methods ignore the highly imbalanced class problem inherent in MMERC, leading to a decrease in the generalization ability of the model and an inability to effectively recognize minority emotion classes. Data mining methods use oversampling to solve the imbalanced classification, but they are unsuitable for MMERC as they disrupt the conversational coherence and modality alignment characteristics of multimodal emotion recognition datasets. To overcome these problems, this paper proposes an IMBA-MMERC, which is an effective framework to address the pervasive issue of class imba lance in MMERC . Within this framework, sample generation for multimodal conversation tackles the application challenges that exist in multimodal conversational emotion recognition datasets, and well-classified encouraging loss mitigates the performance degradation of the model on certain majority classes due to decision boundary deviations. On two English benchmark datasets and one Chinese public dataset, we used two performance indicators to demonstrate the effectiveness and superiority of the proposed IMBA-MMERC. Ablation experiment, case study, and histograms visualization further verify the well performance of the proposed framework.

Multimodal Emotion Recognition Calibration in Conversations

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

Revisiting Multi-modal Emotion Learning with Broad State Space Models and Probability-guidance Fusion

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Generating and encouraging: An effective framework for solving class imbalance in multimodal emotion recognition conversation

CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation

A Facial Expression-Aware Multimodal Multi-task Learning Framework for Emotion Recognition in Multi-party Conversations.

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Multiple Knowledge-Enhanced Interactive Graph Network for Multimodal Conversational Emotion Recognition

Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

UniMEEC: Towards Unified Multimodal Emotion Recognition and Emotion Cause

Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment

A Persona-Infused Cross-Task Graph Network for Multimodal Emotion Recognition with Emotion Shift Detection in Conversations

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment

MMGCN: Multimodal Fusion Via Deep Graph Convolution Network for Emotion Recognition in Conversation

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

Multimodal emotion recognition based on audio and text by using hybrid attention networks