Multimodal Emotion Recognition Based on Multilevel Acoustic and Textual Information

Ya Zhou,Yuyi Xing,Guimin Huang,Qingkai Guo,Nanxiao Deng
DOI: https://doi.org/10.1117/12.3009468
2023-01-01
Abstract:The study and application of multimodal emotion recognition have gained significant popularity in recent years, representing one of the challenging tasks in the field of affective computing. We propose a multimodal speech emotion recognition model that utilizes multiple acoustic and textual information layers. This model incorporates transcribed textual data to complement speech data and enable accurate emotion recognition. In the unimodal model, we employ AlexNet, BiGRU, and HuBERT to extract multi-layer acoustic feature information from speech, and the RoBERTa encoder to extract text features. Additionally, we perform fusion between speech and text by utilizing the co-attentive mechanism to extract complementary information across modalities and eliminate inter-modality noise. This process ultimately enhances the emotional representation of the target modality. Finally, the fused features are utilized to predict the emotion category. Our model achieved a weighted sentiment recognition accuracy of 77.41% and an unweighted accuracy of 78.66%.
What problem does this paper attempt to address?