TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis

Di Wang,Xutong Guo,Yumin Tian,Jinhui Liu,LiHuo He,Xuemei Luo
DOI: https://doi.org/10.1016/j.patcog.2022.109259
IF: 8
2023-01-01
Pattern Recognition
Abstract:Multimodal sentiment analysis (MSA), which aims to recognize sentiment expressed by speakers in videos utilizing textual, visual and acoustic cues, has attracted extensive research attention in recent years. However, textual, visual and acoustic modalities often contribute differently to sentiment analy-sis. In general, text contains more intuitive sentiment-related information and outperforms nonlinguistic modalities in MSA. Seeking a strategy to take advantage of this property to obtain a fusion representation containing more sentiment-related information and simultaneously preserving inter-and intra-modality relationships becomes a significant challenge. To this end, we propose a novel method named Text En-hanced Transformer Fusion Network (TETFN), which learns text-oriented pairwise cross-modal mappings for obtaining effective unified multimodal representations. In particular, it incorporates textual informa-tion in learning sentiment-related nonlinguistic representations through text-based multi-head attention. In addition to preserving consistency information by cross-modal mappings, it also retains the differ-entiated information among modalities through unimodal label prediction. Furthermore, the vision pre-trained model Vision-Transformer is utilized to extract visual features from the original videos to preserve both global and local information of a human face. Extensive experiments on benchmark datasets CMU-MOSI and CMU-MOSEI demonstrate the superior performance of the proposed TETFN over state-of-the-art methods.(c) 2022 Elsevier Ltd. All rights reserved.
What problem does this paper attempt to address?