Cross-modal Enhancement Network for Multimodal Sentiment Analysis
Di Wang,Shuai Liu,Quan Wang,Yumin Tian,Lihuo He,Xinbo Gao
DOI: https://doi.org/10.1109/tmm.2022.3183830
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Multimodal sentiment analysis (MSA) plays an important role in many applications, such as intelligent question-answering, computer-assisted psychotherapy and video understanding, and has attracted considerable attention in recent years. It leverages multimodal signals including verbal language, facial gestures, and acoustic behaviors to identify sentiments in videos. Language modality typically outperforms nonverbal modalities in MSA. Therefore, strengthening the significance of language in MSA will be a vital way to promote recognition accuracy. Considering that the meaning of a sentence often varies in different nonverbal contexts, combining nonverbal information with text representations is conducive to understanding the exact emotion conveyed by an utterance. In this paper, we propose a Cross-modal Enhancement Network (CENet) model to enhance text representations by integrating visual and acoustic information into a language model. Specifically, it embeds a Cross-modal Enhancement (CE) module, which enhances each word representation according to long-range emotional cues implied in unaligned nonverbal data, into a transformer-based pre-trained language model. Moreover, a feature transformation strategy is introduced for acoustic and visual modalities to reduce the distribution differences between the initial representations of verbal and nonverbal modalities, thereby facilitating the fusion of distinct modalities. Extensive experiments on benchmark datasets demonstrate the significant gains of CENet over state-of-the-art methods. The implementation codes are available on https://github.com/Say2L/CENet.
computer science, information systems,telecommunications, software engineering