Leveraging Label Information for Multimodal Emotion Recognition

Peiying Wang,Sunlu Zeng,Junqing Chen,Lu Fan,Meng Chen,Youzheng Wu,Xiaodong He
2023-09-05
Abstract:Multimodal emotion recognition (MER) aims to detect the emotional status of a given expression by combining the speech and text information. Intuitively, label information should be capable of helping the model locate the salient tokens/frames relevant to the specific emotion, which finally facilitates the MER task. Inspired by this, we propose a novel approach for MER by leveraging label information. Specifically, we first obtain the representative label embeddings for both text and speech modalities, then learn the label-enhanced text/speech representations for each utterance via label-token and label-frame interactions. Finally, we devise a novel label-guided attentive fusion module to fuse the label-aware text and speech representations for emotion classification. Extensive experiments were conducted on the public IEMOCAP dataset, and experimental results demonstrate that our proposed approach outperforms existing baselines and achieves new state-of-the-art performance.
Computation and Language,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the problem of how to utilize label information to enhance the model's ability to recognize important text and speech segments related to specific emotions in the task of Multimodal Emotion Recognition (MER). Specifically, most existing methods only use labels as supervisory signals, ignoring the rich semantic information carried by the labels themselves. The authors believe that by leveraging this label information, the model can better understand the input expressions and more accurately locate salient words/frames related to specific emotions, thereby improving the effectiveness of emotion recognition. To achieve this goal, the paper proposes a new method that enhances text and speech representations by introducing label embedding, and further designs a label-guided attentive fusion module to fuse label-aware text and speech representations for final emotion classification. Experimental results show that this method achieves new state-of-the-art performance on the public IEMOCAP dataset.