Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Heng Peng,Xue Gu,Jian Li,Zhaodan Wang,Hao Xu

DOI: https://doi.org/10.3390/electronics13061149

IF: 2.9

2024-03-21

Electronics

Abstract:Multimodal sentiment analysis aims to acquire and integrate sentimental cues from different modalities to identify the sentiment expressed in multimodal data. Despite the widespread adoption of pre-trained language models in recent years to enhance model performance, current research in multimodal sentiment analysis still faces several challenges. Firstly, although pre-trained language models have significantly elevated the density and quality of text features, the present models adhere to a balanced design strategy that lacks a concentrated focus on textual content. Secondly, prevalent feature fusion methods often hinge on spatial consistency assumptions, neglecting essential information about modality interactions and sample relationships within the feature space. In order to surmount these challenges, we propose a text-centric multimodal contrastive learning framework (TCMCL). This framework centers around text and augments text features separately from audio and visual perspectives. In order to effectively learn feature space information from different cross-modal augmented text features, we devised two contrastive learning tasks based on instance prediction and sentiment polarity; this promotes implicit multimodal fusion and obtains more abstract and stable sentiment representations. Our model demonstrates performance that surpasses the current state-of-the-art methods on both the CMU-MOSI and CMU-MOSEI datasets.

engineering, electrical & electronic,computer science, information systems,physics, applied

What problem does this paper attempt to address?

This paper attempts to address the issue of how to better utilize textual information in multimodal sentiment analysis and enhance the robustness and abstraction of sentiment representation through contrastive learning methods. Specifically, the paper identifies two main challenges faced by current multimodal sentiment analysis: 1. Although pre-trained language models have significantly improved the density and quality of textual features, existing models still adopt a balanced design strategy and fail to focus on textual content. 2. Existing feature fusion methods typically rely on the assumption of spatial consistency, neglecting important modal interaction information and sample relationships within the feature space. To address these issues, the authors propose a text-centered multimodal contrastive learning framework (TCMCL). This framework centers on text, enhancing textual features from audio and visual perspectives, and promotes implicit multimodal fusion through two contrastive learning tasks: instance prediction and sentiment polarity. This results in more abstract and stable sentiment representations. Experimental results show that the model outperforms existing state-of-the-art methods on the CMU-MOSI and CMU-MOSEI datasets.

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Cross-modal contrastive learning for multimodal sentiment recognition

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

TSCL-FHFN: two-stage contrastive learning and feature hierarchical fusion network for multimodal sentiment analysis

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

CTHFNet: contrastive translation and hierarchical fusion network for text–video–audio sentiment analysis

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis

A text guided multi-task learning network for multimodal sentiment analysis

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Multi-level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis

CSMF-SPC: Multimodal Sentiment Analysis Model with Effective Context Semantic Modality Fusion and Sentiment Polarity Correction

Sentiment-aware Multimodal Pre-Training for Multimodal Sentiment Analysis

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning.