TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation

Taeyang Yun,Hyunkuk Lim,Jeonghwan Lee,Min Song
2024-03-31
Abstract:Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.
Computation and Language,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of Emotion Recognition in Conversation (ERC), particularly the challenges of emotion recognition in multimodal contexts. Specifically: 1. **Weak Contribution of Non-verbal Modalities**: The weak contribution of non-verbal modalities (such as audio and visual) in emotion recognition makes multimodal emotion recognition tasks difficult. The paper proposes a Teacher-leading Multimodal fusion network (TelME) to enhance the information extraction capability of non-verbal modalities through cross-modal knowledge distillation. 2. **Modality Heterogeneity**: The heterogeneity between different modalities makes effective multimodal fusion challenging. TelME alleviates this heterogeneity by using the text modality as a teacher model and employing a knowledge distillation strategy, thereby improving the emotion recognition performance of non-verbal modalities. 3. **Multimodal Information Fusion**: The paper proposes an Attention-based modality Shifting Fusion method, enabling the student network to support the teacher model in the reverse fusion process, thereby fully utilizing the information provided by non-verbal modalities. In summary, the main objective of this paper is to enhance the effectiveness of non-verbal modalities and improve overall emotion recognition performance in multimodal emotion recognition tasks by leveraging cross-modal knowledge distillation techniques.