TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

Jingru Cui,Zunying Qin,Xingyuan Chen,Guodong Li
DOI: https://doi.org/10.1109/ICHCI58871.2023.10277906
2023-08-04
Abstract:According to the problems of the existing emotion recognition algorithms, which are not rich in emotion information, weak in feature representation and not high in recognition accuracy, this paper proposes a multimodal fusion emotion recognition algorithm based on Transformer (TMFER), which fuses three modalities of text, speech and image information for emotion recognition. For the different characteristics of each modal information, Bert model pre-training processing, MFCC feature extraction and CNN feature extractor extraction methods are used to extract features for each modality respectively, to explore deeper features. To address the problem of unreasonable combination of multi-modal features, the Transformer Encode multi-headed attention mechanism is used to build a feature fusion module to extract and combine potential feature information in different modalities in parallel. The fused data are fed into the algorithm classification module for sentiment recognition classification, and a joint supervised loss function based on large margin learning is customized to solve the problem of unbalanced classification and feature confounding in the baseline model. Finally, based on the IEMOCAP and MELD multimodal datasets, the TMFER algorithm is experimentally compared with current algorithms in the field that are more effective in emotion recognition classification. The experimental results show that the TMFER algorithm outperforms other algorithms in all evaluation metrics.
Computer Science
What problem does this paper attempt to address?