Abstract:Large-scale vision-and-language representation learning has improved performance on various joint vision-language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatio-temporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.

CLAP: Contrastive Language-Audio Pre-training Model for Multi-modal Sentiment Analysis.

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Masked Audio Modeling with CLAP and Multi-Objective Learning

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

SoftMCL: Soft Momentum Contrastive Learning for Fine-grained Sentiment-aware Pre-training

Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Multi-level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment Analysis

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

M$^{3}$SA: Multimodal Sentiment Analysis Based on Multi-Scale Feature Extraction and Multi-Task Learning

Multimodal Pretraining from Monolingual to Multilingual

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

Multimodal Sentiment Analysis with Preferential Fusion and Distance-aware Contrastive Learning.

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations