A Three-stage multimodal emotion recognition network based on text low-rank fusion

Linlin Zhao,Youlong Yang,Tong Ning
DOI: https://doi.org/10.1007/s00530-024-01345-5
IF: 3.9
2024-05-08
Multimedia Systems
Abstract:Multimodal emotion recognition has achieved good results in emotion recognition tasks by fusing multimodal information such as audio, text, and visual. How to use multimodal interaction and fusion to transform sparse unimodal into compact multimodal has become a vital research hotspot in multimodal emotion recognition. However, in multimodality, the extracted unimodal information needs to be representative. The multimodal fusion will cause the loss of feature information, which creates a particular challenge for multimodal emotion recognition. To address these problems, this paper proposes a three-stage multimodal emotion recognition network based on text low-rank fusion by extracting unimodal features, combining bimodal features, and fusing multimodal features. Specifically, we introduce a Residual-based Attention Mechanism for the first feature extraction stage, which can filter out redundant information and extract valuable unimodal information. Then, we use the Cross-modal Transformer to complete the inter-modal interaction. Finally, we introduce a Text-based Low-rank Fusion Module that enhances multimodal fusion by leveraging the complementarity between different modalities, ensuring comprehensive fused features. The accuracy of the proposed model on CMU-MOSEI, CMU-MOSI, and IEMOCAP datasets is 82.1%, 80.8%, and 83.0%, respectively. Meanwhile, many ablation experiments are conducted in this paper to verify the effectiveness and generalization of the model.
computer science, information systems, theory & methods
What problem does this paper attempt to address?