Abstract:Multimodal sentiment analysis holds significant importance within the realm of human-computer interaction. Due to the ease of collecting unlabeled online resources compared to the high costs associated with annotation, it becomes imperative for researchers to develop semi-supervised methods that leverage unlabeled data to enhance model performance. Existing semi-supervised approaches, particularly those applied to trivial image classification tasks, are not suitable for multimodal regression tasks due to their reliance on task-specific augmentation and thresholds designed for classification tasks. To address this limitation, we propose the Multimodal Consistency-based Teacher (MC-Teacher), which incorporates consistency-based pseudo-label technique into semi-supervised multimodal sentiment analysis. In our approach, we first propose synergistic consistency assumption which focus on the consistency among bimodal representation. Building upon this assumption, we develop a learnable filter network that autonomously learns how to identify misleading instances instead of threshold-based methods. This is achieved by leveraging both the implicit discriminant consistency on unlabeled instances and the explicit guidance on constructed training data with labeled instances. Additionally, we design the self-adaptive exponential moving average strategy to decouple the student and teacher networks, utilizing a heuristic momentum coefficient. Through both quantitative and qualitative experiments on two benchmark datasets, we demonstrate the outstanding performances of the proposed MC-Teacher approach. Furthermore, detailed analysis experiments and case studies are provided for each crucial component to intuitively elucidate the inner mechanism and further validate their effectiveness.

Multimodal Consistency-based Teacher for Semi-supervised Multimodal Sentiment Analysis

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Self-HCL: Self-Supervised Multitask Learning with Hybrid Contrastive Learning Strategy for Multimodal Sentiment Analysis

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Social Image-text Sentiment Classification With Cross-Modal Consistency and Knowledge Distillation

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

A Multi-task Mean Teacher for Semi-supervised Facial Affective Behavior Analysis

Low-rank tensor fusion and self-supervised multi-task multimodal sentiment analysis

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis

Learning Speaker-Independent Multimodal Representation for Sentiment Analysis

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

A Fine-Grained Modal Label-Based Multi-Stage Network for Multimodal Sentiment Analysis.

Exploring multimodal data analysis for emotion recognition in teachers’ teaching behavior based on LSTM and MSCNN

Cross-modal contrastive learning for multimodal sentiment recognition

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

AMSA: Adaptive Multimodal Learning for Sentiment Analysis

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

M$^{3}$SA: Multimodal Sentiment Analysis Based on Multi-Scale Feature Extraction and Multi-Task Learning

A text guided multi-task learning network for multimodal sentiment analysis