Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Heng Peng,Xue Gu,Jian Li,Zhaodan Wang,Hao Xu
DOI: https://doi.org/10.3390/electronics13061149
IF: 2.9
2024-03-21
Electronics
Abstract:Multimodal sentiment analysis aims to acquire and integrate sentimental cues from different modalities to identify the sentiment expressed in multimodal data. Despite the widespread adoption of pre-trained language models in recent years to enhance model performance, current research in multimodal sentiment analysis still faces several challenges. Firstly, although pre-trained language models have significantly elevated the density and quality of text features, the present models adhere to a balanced design strategy that lacks a concentrated focus on textual content. Secondly, prevalent feature fusion methods often hinge on spatial consistency assumptions, neglecting essential information about modality interactions and sample relationships within the feature space. In order to surmount these challenges, we propose a text-centric multimodal contrastive learning framework (TCMCL). This framework centers around text and augments text features separately from audio and visual perspectives. In order to effectively learn feature space information from different cross-modal augmented text features, we devised two contrastive learning tasks based on instance prediction and sentiment polarity; this promotes implicit multimodal fusion and obtains more abstract and stable sentiment representations. Our model demonstrates performance that surpasses the current state-of-the-art methods on both the CMU-MOSI and CMU-MOSEI datasets.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
This paper attempts to address the issue of how to better utilize textual information in multimodal sentiment analysis and enhance the robustness and abstraction of sentiment representation through contrastive learning methods. Specifically, the paper identifies two main challenges faced by current multimodal sentiment analysis: 1. Although pre-trained language models have significantly improved the density and quality of textual features, existing models still adopt a balanced design strategy and fail to focus on textual content. 2. Existing feature fusion methods typically rely on the assumption of spatial consistency, neglecting important modal interaction information and sample relationships within the feature space. To address these issues, the authors propose a text-centered multimodal contrastive learning framework (TCMCL). This framework centers on text, enhancing textual features from audio and visual perspectives, and promotes implicit multimodal fusion through two contrastive learning tasks: instance prediction and sentiment polarity. This results in more abstract and stable sentiment representations. Experimental results show that the model outperforms existing state-of-the-art methods on the CMU-MOSI and CMU-MOSEI datasets.