CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Qiaohong Chen,Shufan Xie,Xian Fang,Qi Sun
DOI: https://doi.org/10.1007/s00371-024-03668-w
IF: 2.835
2024-10-12
The Visual Computer
Abstract:Multimodal Sentiment Analysis aims to predict human sentiment polarity or intensity by heterogeneous information sources such as text, audio, and video. Previous research has focused on exploring multimodal fusion strategies while neglecting intra-modal noise. Indeed, both are crucial for sentiment prediction, as sentiment information may be dispersed across modalities or aggregated within a single modality. This paper presents a novel framework called contrastive translate and hierarchical fusion network (CTHFNet) to discuss complex relationships within and between modalities. Specifically, CTHFNet leverages a modality translator based on contrastive learning and the Seq2seq model to translate non-verbal modalities into textual modalities to filter unimodal noise. Besides, CTHFNet utilizes a cross-hierarchical multimodal fusion network that captures interactions between modalities at different hierarchies and classes with a contrastive learning task. Extensive experiments on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that our approach outperforms state-of-the-art methods in almost all metrics. The implementation of this work is available at https://zenodo.org/records/12492274.
computer science, software engineering
What problem does this paper attempt to address?