Abstract:In recent years, Intelligent Personal Assistants (IPAs) have emerged as important tools in human–computer interaction, with a wide range of applications such as voice assistant, virtual customer service, and navigation. Capturing and understanding the prominent emotional needs of users is important for improving the quality of service of IPAs. Multimodal emotion recognition in conversation (MMERC) aimed at automatically identifying and tracking the emotional states of speakers during the dialogue process has become a crucial component for building emotional IPAs and attracted increasing attention. Current research in this field is based on graph simulation for cross-modal and single-modal interactions. However, these methods ignore the highly imbalanced class problem inherent in MMERC, leading to a decrease in the generalization ability of the model and an inability to effectively recognize minority emotion classes. Data mining methods use oversampling to solve the imbalanced classification, but they are unsuitable for MMERC as they disrupt the conversational coherence and modality alignment characteristics of multimodal emotion recognition datasets. To overcome these problems, this paper proposes an IMBA-MMERC, which is an effective framework to address the pervasive issue of class imba lance in MMERC . Within this framework, sample generation for multimodal conversation tackles the application challenges that exist in multimodal conversational emotion recognition datasets, and well-classified encouraging loss mitigates the performance degradation of the model on certain majority classes due to decision boundary deviations. On two English benchmark datasets and one Chinese public dataset, we used two performance indicators to demonstrate the effectiveness and superiority of the proposed IMBA-MMERC. Ablation experiment, case study, and histograms visualization further verify the well performance of the proposed framework.

Learning What and when to Drop

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

Multi-Modal Attentive Prompt Learning for Few-shot Emotion Recognition in Conversations

Multimodal Emotional Classification Based on Meaningful Learning

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Deep Imbalanced Learning for Multimodal Emotion Recognition in Conversations

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Curriculum Learning Meets Directed Acyclic Graph for Multimodal Emotion Recognition

Decoupled Multimodal Distilling for Emotion Recognition

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

Fusing pairwise modalities for emotion recognition in conversations

SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion Recognition

Generating and encouraging: An effective framework for solving class imbalance in multimodal emotion recognition conversation

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model