Abstract:Multimodal sentiment classification is a notable research field that aims to refine sentimental information and classify the sentiment tendency from sequential multimodal data. Most existing sentimental recognition algorithms explore multimodal fusion schemes that achieve good performance. However, there are two key challenges to overcome. First, it is essential to effectively extract inter- and intra-modality features prior to fusion, while simultaneously reducing ambiguity. The second challenge is how to learn modality-invariant representations that capture the underlying similarities. In this paper, we present a modality-invariant temporal learning technique and a new gated inter-modality attention mechanism to overcome these issues. For the first challenge, our proposed gated inter-modality attention mechanism performs modality interactions and filters inconsistencies from multiple modalities in an adaptive manner. We also use parallel structures to learn more comprehensive sentimental information in pairs (i.e., acoustic and visual). In addition, to address the second problem, we treat each modality as a multivariate Gaussian distribution (considering each timestamp as a single Gaussian distribution) and use the KL divergence to capture the implicit temporal distribution-level similarities. These strategies are helpful in reducing domain shifts between different modalities and extracting effective sequential modality-invariant representations. We have conducted experiments on several public datasets (i.e., YouTube and MOUD) and the results show that our proposed method outperforms the state-of-the-art multimodal sentiment categorization methods.

Sentiment-aware Multimodal Pre-Training for Multimodal Sentiment Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning.

Multimodal Pretraining from Monolingual to Multilingual

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Based on Pre-LN Transformer Interaction

Learning Speaker-Independent Multimodal Representation for Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Sentiment-Aware Word and Sentence Level Pre-training for Sentiment Analysis

MF-BERT: Multimodal Fusion in Pre-Trained BERT for Sentiment Analysis

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training

Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis

Multiple Contrastive Learning for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis based on Supervised Contrastive Learning and Cross-modal Translation under Modalities Missing * .

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis