Abstract:Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss $\mathcal{L}_{\rm EC}$ is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address several key challenges in Emotion Recognition in Conversation (ERC): 1. **Insufficient Context Modeling**: - Existing ERC methods struggle to capture global emotional continuity in long conversations. While methods based on recurrent neural networks can establish natural temporal correlations, they may not effectively capture emotional continuity in long conversations. On the other hand, methods based on attention mechanisms can aggregate emotional features at multiple levels but may not be as effective as temporal models in capturing emotional continuity over long time spans. 2. **Ambiguous Dialogue Relationships**: - Current methods lack detailed modeling of emotional influences in dialogue relationships. In real conversations, direct dialogue relationships often lead to more direct emotional transmission, but the ERC field still lacks methods that model emotional influences from the perspective of dialogue relationships in detail. 3. **Overfitting in Speaker Modeling**: - In ERC tasks, different speakers exhibit significant differences in emotional expression. Although some studies utilize complex network designs to leverage fine-grained information, these methods are prone to overfitting in different dialogue scenarios, thus limiting their effectiveness. To address these issues, the authors propose a Hybrid Continuous Attributive Network (HCAN), which improves existing methods through the following approaches: - **Emotional Continuation Encoding (ECE)**: Combines recurrent units and attention modules to extract more robust features in different dialogue scenarios, particularly excelling in long conversation samples. - **Emotional Attribution Encoding (EAE)**: Based on the IA-attention mechanism, it models internal and external emotional attributions of each sentence from an attribution perspective, providing more direct and accurate human emotional understanding. - **Emotional Cognitive Loss (LEC)**: Enhances the model's robustness and generalization ability by combining cross-entropy, KL divergence, and adversarial emotional decoupling loss, mitigating emotional drift and overfitting in speaker modeling. Through these innovations, HCAN achieves state-of-the-art performance on three benchmark datasets, demonstrating its superiority and generalization ability in different application scenarios.

Watch the Speakers: A Hybrid Continuous Attribution Network for Emotion Recognition in Conversation With Emotion Disentanglement

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

A Contextual Attention Network for Multimodal Emotion Recognition in Conversation

A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation

SI-LSTM: Speaker Hybrid Long-short Term Memory and Cross Modal Attention for Emotion Recognition in Conversation

Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis

Emotion Recognition in Conversation Based on a Dynamic Complementary Graph Convolutional Network

Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation

Context- and Sentiment-Aware Networks for Emotion Recognition in Conversation

LR-GCN: Latent Relation-Aware Graph Convolutional Network for Conversational Emotion Recognition

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

DialoguePCN: Perception and Cognition Network for Emotion Recognition in Conversations

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

Emotion recognition in conversations with emotion shift detection based on multi-task learning

InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models

A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

Dialogue emotion model based on local–global context encoder and commonsense knowledge fusion attention