Abstract:Multimodal sentiment analysis aims to extract sentiment information expressed by users from multimodal data, including linguistic, acoustic, and visual cues. However, the heterogeneity of multimodal data leads to disparities in modal distribution, thereby impacting the model’s ability to effectively integrate complementarity and redundancy across modalities. Additionally, existing approaches often merge modalities directly after obtaining their representations, overlooking potential emotional correlations between them. To tackle these challenges, we propose a Multiview Collaborative Perception (MVCP) framework for multimodal sentiment analysis. This framework consists primarily of two modules: Multimodal Disentangled Representation Learning (MDRL) and Cross-Modal Context Association Mining (CMCAM). The MDRL module employs a joint learning layer comprising a common encoder and an exclusive encoder. This layer maps multimodal data to a hypersphere, learning common and exclusive representations for each modality, thus mitigating the semantic gap arising from modal heterogeneity. To further bridge semantic gaps and capture complex inter-modal correlations, the CMCAM module utilizes multiple attention mechanisms to mine cross-modal and contextual sentiment associations, yielding joint representations with rich multimodal semantic interactions. In this stage, the CMCAM module only discovers the correlation information among the common representations in order to maintain the exclusive representations of different modalities. Finally, a multitask learning framework is adopted to achieve parameter sharing between single-modal tasks and improve sentiment prediction performance. Experimental results on the MOSI and MOSEI datasets demonstrate the effectiveness of the proposed method.

Multimodal Sentiment Analysis based on Supervised Contrastive Learning and Cross-modal Translation under Modalities Missing * .

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

Cross-modal contrastive learning for multimodal sentiment recognition

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Modality translation-based multimodal sentiment analysis under uncertain missing modalities

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Missing Modality Reconstruction Network Based on Shared-Specific Features

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Toward Robust Multimodal Learning using Multimodal Foundational Models

Multimodal Sentiment Analysis Based on Disentangled Representation Learning and Cross-Modal-context Association Mining

Multiple Contrastive Learning for Multimodal Sentiment Analysis

Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion

Multimodal Sentiment Analysis Based on Pre-LN Transformer Interaction

MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis

Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Learning Modality-Complementary and Eliminating-Redundancy Representations with Multi-Task Learning for Multimodal Sentiment Analysis