TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Taeyang Yun,Hyunkuk Lim,Jeonghwan Lee,Min Song

2024-03-31

Abstract:Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.

Computation and Language,Machine Learning,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of Emotion Recognition in Conversation (ERC), particularly the challenges of emotion recognition in multimodal contexts. Specifically: 1. **Weak Contribution of Non-verbal Modalities**: The weak contribution of non-verbal modalities (such as audio and visual) in emotion recognition makes multimodal emotion recognition tasks difficult. The paper proposes a Teacher-leading Multimodal fusion network (TelME) to enhance the information extraction capability of non-verbal modalities through cross-modal knowledge distillation. 2. **Modality Heterogeneity**: The heterogeneity between different modalities makes effective multimodal fusion challenging. TelME alleviates this heterogeneity by using the text modality as a teacher model and employing a knowledge distillation strategy, thereby improving the emotion recognition performance of non-verbal modalities. 3. **Multimodal Information Fusion**: The paper proposes an Attention-based modality Shifting Fusion method, enabling the student network to support the teacher model in the reverse fusion process, thereby fully utilizing the information provided by non-verbal modalities. In summary, the main objective of this paper is to enhance the effectiveness of non-verbal modalities and improve overall emotion recognition performance in multimodal emotion recognition tasks by leveraging cross-modal knowledge distillation techniques.

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Multimodal Emotional Classification Based on Meaningful Learning

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

ITEACH-Net: Inverted Teacher-studEnt seArCH Network for Emotion Recognition in Conversation

Fusing pairwise modalities for emotion recognition in conversations

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

Speaker-aware cognitive network with cross-modal attention for multimodal emotion recognition in conversation

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation

MultiEMO: an Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations.

Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

MF-Net: a multimodal fusion network for emotion recognition based on multiple physiological signals

M-MELD: A Multilingual Multi-Party Dataset for Emotion Recognition in Conversations

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation

cross-modal fusion techniques for utterance-level emotion recognition from text and speech

CTNet: Conversational Transformer Network for Emotion Recognition

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Multimodal Emotion Recognition based on the Fusion of EEG Signals and Eye Movement Data

MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation